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ABSTRACT 


This  report  sunaaLrizes  the  research  accomplished  under 
Grant  AF-AFCoR-62-69»  It  consists  of  two  separate  parts 
representing  two  distinct  investigations  carried  out  withiu 
the  scope  of  the  grant.  The  first  part  of  the  report  is 
concerned  with  optimization  criteria  in  systems  design;  the 
second  deals  with  the  identification  problem  for  nonlinear 
systems.  These  two  topics  correspond  to  the  abstracts 
included  in  the  Summary  Report,  USCEC  98-201,  EE-24  submitted 
in  April,  1963. 

Research  in  both  of  the  areas  described  herein  is 
continuing  under  the  renewal  Grant  AF-AFOSR-75-63.  Mr. 

R.  B.  McGhee  has  completed  a  dissertation  based  upon  further 
development  of  Part  II  of  this  report.  This  dissertation 
is  presently  being  prepared  for  journal  publication.  As 
further  significant  results  are  obtained,  they  will  also 
be  submitted  for  publication  in  the  appropriate  technical 
Journals. 
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PART  I 


OPTIH12ATION  CRITERIA 
1.  INTRCDUCTION 

There  is  a  fundamental  difference  between  information  theory  and  the 
theory  of  control.  In  information  theory,  it  is  desired  to  operate  on  an 
output  signal  so  that  as  much  information  about  the  desired  input  signal  is 
obtained  as  possible.  In  control,  the  output  is  often  asked  to  reproduce 
some  desired  signal,  usually  the  input,  as  closely  as  possible.  If  systems 
are  optimized  according  to  these  two  criteria,  the  results  are  in  general 
not  the  same.  The  difficulty  in  formulating  a  suitable  optimization 
criterion  for  systems  is  mainly  due  to  the  wide  range  of  purposes  for  which 
systems  are  used.  It  would  seem  hardly  reasonable,  therefore,  to  expsect  to 
define  a  unique  optimization  criterion  or  measure  of  quality;  at  best  one 
can  only  devise  criteria  which  are  satisfactory  over  a  wide  range  of  applica¬ 
tions.  The  most  general  description  of  the  purpxjse  of  a  system  is  perhaps 
that  it  should  give  information  about  the  input  (or  other  desired  signal). 

If  the  information  desired  is  simply  a  reproduction  of  the  input,  or  other 
desired  signal,  then  we  have  a  problem  in  control.  Sometimes,  however,  it 
may  be  required  to  extract  all  of  the  information  that  is  implicit  in  the 
output  without  regard  to  the  possible  complexity  of  the  interpretation 
process.  This  type  of  requirement  is  psarticularly  important  when  it  is 
difficult  or  impossible  to  repeat  an  experiment.  Such  cases  are  common 
in  practice,  and  under  these  circumstances  it  is  a  prime  necessity  that  the 
information  gained  be  maximum  whether  or  not  the  output  exactly  resembles 
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the  input.  In  such  a  case,  the  quality  of  the  output  is  more  important 
than  that  it  exactly  resemble  the  input.  In  such  a  case,  the  output 
Buat  be  interpreted  or  reconstructed  in  some  way.  Herein  lies  the 
difficulty  of  Instrumenting  an  appropriate  interpretive  system.  In 
other  words,  somehow  the  information  must  be  made  useful  to  the  observer. 
Such  a  procedure  is  a  type  of  decoding. 

One  of  the  purposes  of  this  research  is  to  consider  various 
optimization  criteria  so  as  to  determine  which  is  best  for  the  constrained 
system  under  consideration.  Once  a  system  has  been  optimized,  it  is  very 
desirable  to  be  able  to  evaluate  it  by  comparing  it  to  the  theoretically 
optimum  system.  In  other  words,  it  should  be  possible  to  determine  the 
optimum  system  performance  without  regard  to  any  specific  system  configura¬ 
tion.  Once  this  is  done,  a  precise  system  evaluation  can  then  be  made. 

Such  a  procedure  also  has  the  valuable  property  that  additional  research 
on  the  proposed  system  can  be  made  to  yield  the  maximum  return  for  the 
minimum  cost.  In  other  words,  the  effort  can  be  adjusted  so  as  to  take 
advantage  of  the  law  of  diminishing  returns.  If  the  proposed  system  is 
far  from  this  theoretical  optimum,  additional  effort  is  indicated,  whereas 
if  the  proposed  system  is  near  to  the  theoretical  optimum,  the  additional 
effort  may  not  be  worthwhile.  Furthermore,  the  systems  evaluation 
techniques  to  be  proposed  will  be  useful  in  the  comparison  of  systems 
on  an  absolute  basis,  so  that  if  more  than  one  system  is  under  considera¬ 
tion,  the  optimum  system  can  be  appropriately  chosen. 
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The  nost  common  optimization  criterion  uned  in  system  design 


particularly  feed  back  systems  desiftn,  is  mi.-.i.r.ization  of  meaui  square 
(or  integral  square)  error.  The  disadvantages  of  such  a  criterion  are 
well  known«  and  include:  the  requirement  of  long  time  averaging,  so  that 
the  system  is  essentially  optimized  only  for  steady  state  performances; 
excessive  emphasis  of  large  amplitude  but  short  time  deviations  of  the 
output  signal  from  the  mean  of  the  desired  signal,  as  well  as  other  dis¬ 
advantages.  The  principal  advantage  of  this  criterion  is  that  it  generally 
leads  to  matfiematically  tractable  expressions,  where  as  related  criteria 
such  as  the  minimization  of  absolute  error  do  not.  Wl.en  a  system  is  subject 
to  stochastic  disturbances  such  as  noise,  random  load  disturbances,  etc., 
a  commonly  used  criterion  of  goodness  is  the  maximization  of  signal-to- 
noise  ratio  at  certain  points  in  the  system.  Other  arbitrary  criteria 
which  have  been  used  include  specification  of  some  minimum  liklihood  of 
loss  of  track  in  tracking  systems,  minimum  probability  of  exceeding 
thresholds  on  certain  variables,  certain  statistical  detection  criteria, 
etc.  Let  us  consider  some  of  the  disadvantages  of  these  arbitrary  criteria. 
A  disadvantage  of  the  maximization  of  signal-to-noise  ratio  criterion  and 
related  criteria  is  that  no  constraint  is  put  upon  the  amount  of  time 
required  to  perform  the  desired  maximization.  For  example,  the  signal- 
to-noise  ratio  can  be  made  to  approach  infinity  by  averaging  (that  is, 
integrating)  over  an  infinite  period  of  time.  In  the  vast  majority  of 
engineering  systems,  however,  it  is  required  that  the  system  perform 
optimally  in  a  specified  period  of  time,  or,  more  commonly,  that  the  system 
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reach  its  optimuie  perfonaance  ir.  t'r.o  aii.-iir.rj/n  possible  time.  Thus  the 
question  of  a  system  bandwidth,  resr^onse  ti.r.e,  etc.,  is  equally  as 
laportant  as  maximization  of  si;jn%l-to-noise  rctio.  The  same  disadvantage 
applies  to  any  criterion  such  as  ninimization  of  mean  square  error  which 
has  no  time  constraint  included  in  it.  A  di3advanta,;;e  of  setting  thresh¬ 
olding  limits  and  other  arbitrary  criteria  which  put  bounds  upon  system 
behavior  is  that  they  are  arbitrary,  and  therefore  do  not  necessarily  lead 
to  an  optimum  system  in  the  sense  that  all  errors  are  minimized.  As  an 
example,  consider  the  typical  specifications  and  criteria  set  upon  antenna 
system  design.  It  is  common  to  specify  somewhat  arbitrarily  the  beamwidth, 
sidelobe  level,  antenna  gain,  resolution  capability,  and  other  properties 
of  the  system.  All  these  specifications  arc  actually  somewhat  crude  attempts 
to  maximize  the  rate  at  which  desired  information  is  being  acquired  by  the 
antenna  and  processed  by  the  system.  A  comron  requirement  is  that  the 
beamwidth  be  as  narrow  as  possible  for  the  specified  aperture.  In  a  search¬ 
ing  radar,  if  there  is  a  specified  maximum  search  tine,  a  narrow  beam  antenna 
must  search  at  a  high  angular  rate  in  order  to  get  adequate  coverage  during 
the  specified  frame  time.  It  could  very  well  be  that  the  rate  at  which 
information  is  being  acquired  would  be  improved  by  using  a  wider  beam 
antenna  and  searching  at  a  slower  rate.  As  another  example,  it  is  commonly 
required  that  the  sidelobes  cf  an  antenna  be  small  compared  to  the  main 
beam.  In  fact,  it  is  often  stated  that  the  ideal  auitenna  should  have  no 
sidelobes  at  all.  In  the  case  of  ground  mapping,  however,  information  is 
gained  through  the  sidelobes  ais  well  as  through  the  main  beam,  and  if  this 
information  could  be  so  processed  as  to  be  made  useful,  the  sidelobes  might 


actually  be  desirable. 

It  is  therefore  concluded  that  the  optimum  optimization  criterion  for 
any  systea  is  the  maximization  of  the  information  rate  at  the  desired  output. 
This  Is  a  most  general  statement,  but  it  is  valid,  since  it  is  always  possible 
la  principle  to  define  Just  what  information  is  desired  from  the  system.  It 
then  follows  that  that  system  which  maximizes  the  desired  information  rate 
will  be  the  optimum  one.  If  the  maximization  of  information  rate  is  to  be 
used  as  an  optimization  criterion,  the  optimization  must  take  place  subject 
to  the  constraints  which  exist  on  the  system.  The  most  common  of  such 
constraints  is  system  cost.  There  are  many  types  of  cost  functions: 
dollar  coat  is  Just  one  of  them,  although  most  cost  functions  ultimately 
have  some  sort  of  economic  basis.  Typical  cost  functions  include  size, 
weight,  power  consumption,  complexity  of  equipment,  etc.  Reallbility  is 
also  of  great  practical  importance.  In  general,  the  system  must  be  optimized 
subject  to  some  upper  bound  on  cost  and  lower  bound  on  reliability.  Since 
reliability  is  related  to  complexity  of  equipment,  it  can  also  often  be 
specified  as  a  cost  function.  Thus,  the  proposed  optimization  criterion 
will  be  the  aiaximization  of  information  rate  subject  to  some  maximum  cost 
constraint  where  the  cost  constraint  may  include  any  or  all  of  the  preceding 
specified  functions. 


2.  SYSTEMS  EVALUATION 

Note  that  the  propjosed  criterion  subject  to  constraints  is  well  suited 
to  the  comparison  of  or  evaluation  of  systems.  Clearly,  two  systems  can  be 
readily  compared  as  to  their  relative  merit,  or  a  single  system  can  be 


compared  to  the  theoretical  optiinun.  The  criterion  can  also  be  used  directly 
for  system  optimization,  except  that  when  the  systea  is  so  optimized,  the 
information  acquired  is  not  necessarily  in  its  most  usable  fora.  The 
example  of  the  information  availcbli  in  the  sidelobes  of  an  auitenna  system 
illustrates  this  possibility.  Also,  the  deternination  of  the  theoretical 
optimum  for  a  certain  class  of  systems  does  not  necessarily  mean  that  such 
a  system  can  be  instrumented.  The  classical  example  of  problems  of  this 
sort  occurs  in  the  problem  of  coding.  Shannon  has  shown  that  even  in  a 
noisy  communication  channel,  it  is  possible  to  encode  a  message  source  in 
such  a  way  that  the  information  rate  will  approach  the  channel  capacity  of 
the  system  with  the  probability  of  error  being  negligibly  small.  Thus  one 
need  only  determine  the  channel  capacity  of  the  system  exclusive  of  the 
source  to  determine  the  optimum  rate  at  which  information  can  theoretically 
be  transmitted.  However,  the  theorem  says  that  it  is  "possible  to  encode" 
so  as  to  reach  this  theoretical  maximum;  it  does  not  specify  what  this 
optimum  code  should  be.  As  a  matter  of  fact,  a  practical  code  which  even 
approaches  the  optimum  has  not  as  yet  been  devised. 

The  aforementioned  difficulties  do  not  affect  the  utility  of  the 
criterion  for  system  evaluation  anl  comparison.  when  the  criterion  is 
used  for  system  optimization,  the  requirement  of  maximum  utility  can  be 
imposed  as  a  constraint  on  the  system.  For  example,  a  constraint  which  is 
commonly  imposed  on  most  critei  ia  is  that  of  physical  reali'ability ,  since 
a  system  which  is  or,timized  without  such  a  constraint  may  end  up  by  being 
not  physically  realizable  and  hence  not  useful.  Another  way  around  this 


difficulty  is  to  specify  the  form  of  the  syste.n  to  be  used  apriori, 
loaving  only  certain  parameters  (such  as  the  coefficients  in  the  system 
differential  equation)  undetermined.  The  system  is  then  optimized  by 
■axiaizini;  the  information  rate  with  respect  to  the  variable  parameters 
aubjeet  to  whatever  constraints  are  necessary.  This  approach  also  has  the 
advantage  of  Initially  specifying  the  order  of  the  system  differential 
aquation  and  hence  the  complexity  of  the  system. 

3.  THE  INrORMATION  RATE  CRITERION 

Information  content  or  entropy  is  a  concept  that  deals  with  freedom 
of  choice.  The  greater  the  number  of  choices  available,  the  greater  the 
entropy.  In  the  case  of  a  signal  which  can  assume  only  certain  discrete 
levels,  the  entropy  per  sample  point  increases  with  the  number  of  discrete 
levels  possible.  Indeed,  if  the  probability  of  each  level  occurring  is 
equally  likely,  then  the  entropy  is  simply  the  logarithm  of  the  number  of 
possible  levels.  If  it  were  not  for  the  disturbing  influence  of  noise,  a 
continuous  signal  would  be  capable  of  being  quantized  into  an  infinite 
number  of  distinguishable  levels,  even  if  the  overall  range  in  signal 
amplitude  were  finite.  If  the  minimum  distinguishable  difference  in  levels 
is  set,  and  the  dynamic  range  is  finite,  an  upper  bound  on  the  entropy  is 
set.  The  maximum  entropy  is,  therefore,  a  function  of  the  minimum  distin¬ 
guishable  interval  where  the  latter  is  a  function  of  the  signal-to-noise 
ratio.  It  is  very  difficult  to  set  any  absolute  criterion  for  resolution. 
For  example,  consider  the  resolution  of  two  targets  by  an  antenna.  Targets 


cominonly  are  said  to  be  resolved  if  ti.ey  are  a  half-beaswidth  (or  aore) 
apart.  But  this  designation  is  jxirely  arbitrary  and  again  depends  on  the 
algnal-to-noise  ratio.  Such  a  separation  lay  be  wholly  inadequate  for  low 
signal-to-noiae  ratios;  conversely  targets  much  closer  than  this  angular 
separation  may  be  resolved  at  high  signal-to-noise  ratios.  Hather  t!;an 
attempt  to  specify  arbitrarily  some  minicum  distinguishable  interval  or 
difference  in  levels,  a  far  more  sophisticated  approach  would  be  to  specify 
information  content  in  te:ins  of  the  probability  distributions  of  signal 
and  noise  in  such  a  manner  ti.at  the  useful  total  entropy  or  information 
rate  is,  in  effect,  a  function  of  sigr.al-to-noise  ratio.  The  total  entropy 
is  related  to  not  only  the  signal-to-noise  ratio  but  also  the  bandwidth  and 
time  duration  of  the  message.  This  can  be  seen  from  Shannon  Sampling 
theorem  which  states  that  the  number  of  samples  in  the  message  is  propor¬ 
tional  to  the  bandwidth- time  product.  The  information  rate  is  t.'.e  total 
entropy  divided  by  the  time  duration  of  the  message  (provided  the  message 
is  quite  long).  Consequently  the  rate  is  proportional  to  the  bandwidth 
of  the  signal  and  the  signal-to-noise  ratio.  There  is  no  absolute  level 
of  distinguishability  in  such  a  definition.  The  information  content  and 
rate  simply  improve  as  the  signal-to-noise  ratio  improves  and/or  the  band¬ 
width  increases.  The  classical  Shannon  communication  system  is  represented 
in  Figure  1.  In  this  system  the  nature  of  the  source  is  assumed  to  be 
known  completely  (that  is,  its  f;tatictics  are  known)  .  The  encoder  includes 
the  transmitter,  the  modulator,  and  waoteve;  actual  encoding  scheme  is 

•  It  is  true  that  when  a  specific  message  is  transmitted  its  exact  form  is 
known,  however,  one  cannot  design  a  system  for  that  message  alone,  thus, 
apriori  the  nature  of  any  message  transmitted  by  a  known  source  Ccui  at  best 
be  known  statistically. 
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used  preceding  transmission.  The  link  is  the  actual  transmission  channel 
between  transmitter  and  receiver.  The  decoder  performs  the  inverse 
operation  to  that  of  the  encoder  and  includes  the  receiver  and  whatever 
demodulating  and  decoding  apparatus  is  necessary.  For  the  systems  presently 
under  consideration,  an  appropriate  representation  will  be  Figure, 2. 


Figure  1.  Conounications  System 


Figure  2.  General  System 

In  general,  signal  and  noise  may  appear  on  the  input  of  a  system,  where  the 
link  is  essentially  an  adder,  but,  different  from  the  classical  communica¬ 
tions  system  in  Figure  1,  the  noise  may  also  be  added  in  the  system  itself 

as  shown. 

5.  LINEAR  INFORMATION  THEORY 

The  information  content,  or  entropy,  of  a  random  time  function  x(t) 
with  a  (continuous)  amplitude  probability  distribution  p(x)  is  by 
definition 

r  ® 

H(x)  «  -  i  p(x)  An  p(x)  dx  (1) 

OD 


.Q- 


The  entropy  H(x)  is  the  infor.ratior  contained  ir.  a  sample  of  x  at  ti.T.e 
t  .  For  example,  the  entropy  of  a  la.-  ilan  distribution  with  variar.ce 
and  xean  zero  is  given  by 

H  -  I  «n  (2xeff^)  (2) 


This,  Incidently,  is  the  maximum  entropy  per  sample  of  all  continuous 
distributions  with  a  given  variance.  The  information  contained  in  the 
total  function  x(t)  ,  which  is  assumed  to  be  of  duration  T  ,  is  found 
from  the  joint  entropy  function 

H(Xt,x_»***,x  )  «  -  I  •••  p(x. ,x,, • • • ,x  ) 

Xc  n  j  Xt  n 

''-OD  -CD 

n 

in  p(x,  ,X3,***,x_)  It  dx.  (3) 

where  the  subscripts  denote  the  instants  of  time  at  which  the  samples  are 
taken..  If  the  time  function  has  a  finite  frequency  banJwidth  B  ,  2TB 
samples  will  completely  represent  the  time  function.  It  is  important  to 
recognize  that  a  function  of  finite  duration  cannot  have  a  Fourier  transform 
which  extends  over  a  finite  domain,  so  that  the  sampling  theorem 
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(2Bt  -  n) 


(U) 


t 

o 


(5) 


is  only  an  approximation  to  the  true  x(t5  .  This  approximation  is  very 
close  when  the  function  ar.J  its  transform  full  off  without  abrupt  discontin¬ 
uities  (e.g.,  the  Saussian  function).  equation  {U)  therefore  represents  a 

Actually  2TB  »  1  samples  are  needed,  but  2T3  «•  1  2T3  for  TB  »1  . 
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function  whose  principal  spectral  energy  lies  in  an  audio  oand  of  width 

T 

B  .  Note  that  if  x(t)  is  of  duration  T  ,  aporoximately  7“  =  23T 

to 

samples  are  needed  to  completely  define  the  function.  If  the  samples 
x(t^)  a  are  statistically  independent,  Equation  (3)  becomes 


H.,  , 

total 


n 

X.)  dx  .  * 

J  J 

In  this  case,  the  total  entropy  can  be  found  by  adding  the  entropies  at 
each  sample  point.  If  p(x)  is  stationary  in  time,  the  entropy  at  each 
sample  point  is  the  same,  ana  the  total  entropy  is  simply  the  entropy  per 
sample  times  the  total  number  of  sample  points.  The  successive  samples 
will  be  uncorrelated  if  the  spectrum  of  the  signal  is  flat  over  the 
specified  bandwidth.  Zero  correlation  implies  statistical  independence  for 
most  distributions  of  interest,  such  as  the  Saussian  distribution. 

Information  rate  is  defined  as 


I  J 


J-1 


-  CD 


H 

R  ■  lim  — 
T-w» 


total 

T 


where  total  entropy  of  the  signal 
If  the  successive  samples  are  independent  and  the 
then,  from  Equation  (6) 


(7) 

x(t)  of  duration  T  . 
process  is  stationary. 


R  -  lim  i  2T3  H(x)  -  2B  K(x) 
tu® 


(8) 
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Under  these  circumstances,  tne  information  rate  can  oe  thought  of  as  the 
entropy  per  sample  times  the  number  of  samples  per  second. 

The  "useful  entropy"  at  the  receiver  is  the  received  entropy  H(y) 
less  the  received  conditional  entropy  H(y*x)  where  the  distribution  of 
received  signals  y  depends  on  the  transmitted  si/^nal  x  .  By  definition, 

CD 

H(y'x)  ■  "  JJ  p(ylx)  dy  dx  (9) 

-  00 

where  p(x,y)  is  the  joint  distribution  of  x  and  y  and  p(ylx)  is 
the  conditional  distribution  of  y  given  x  .  The  useful  entropy  is  then 

Hy  -  H(y)  -  H(ylx)  (10) 

It  can  be  shown  that  H(ylx)  -•  0  ,  or  H(y)  when  x  is  linearly 

related  to  y  and  that  H(y|x)  -»  H(y)  or  -•  0  when  x  and  y  are 
Independent,  y  will  differ  from  x  because  of  noise  and  distortion,  so 
that  the  useful  entropy  decreases  with  increasing  noise.  The  concepts  of 
useful  entropy  and  its  time  rate  of  change,  called  "information  rate", 
are  very  important.  In  this  report,  the  system  is  said  to  be  optimized 
when  these  quantities  are  maximized  on  the  output.  It  can  be  shown  that 

H(ylx)  -  H(N)  if  X  is  the  sum  of  signal  and  noise  (x  >  S  -t-  N)  and 

the  signal  and  noise  are  independent  and  that  H(y)  s  H(x)  to  within  an 

additive  constant  if  y  and  x  are  linearly  related.  If  x  consists 

of  the  sum  of  independent  signal  S  and  noise  N  then  the  useful  entropy 
at  the  receiver  is  the  signal-plus-noise  entropy  less  the  noise  entropy, 
or 

»u  - 
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The  useful  information  rate  at  the  receiver  is  then 


R 


2B(H. 


S+N 


V 


(12) 


The  maxisiuin  rate,  or  "channel  capacity"  for  a  given  average  transmitter 
power  P-  and  noise  power  P„  is  achieved  when  both  the  signal  and 
noise  are  Gaussian  and  independent.  From  Equation  (2),  Equation  (12) 

becomes 

C  «  R  -  23  4  -fin  r2ite(P,+P„)T  -  in  (2aeP„)| 

BAX  &  V  kS  N  M  J 

-  B  /n  +  ^')  (13) 

where  "C"  la  the  "channel  capacity".  Equation  (15)  is  one  of  the  most 
widely  used  (and  misused)  formulas  in  information  theory.  It  is  strictly 
valid  only  when:  (1)  the  signal  is  Gaussian,  (2)  the  noise  is  Gaussian  and 
additive  to  the  signal,  (3)  the  noise  and  signal  are  statistically  independ¬ 
ent,  (4)  all  processes  are  stationary,  and  (5)  the  signal,  noise,  and 
signal -plus-noise  spectra  are  flat  and  limited  to  band  B.  Equation  (13) 
illustrates  one  of  the  advantages  of  the  maximization  of  information  rate 
over  the  maximization  of  signal-to-noise  ratio  and  related  criteria.  The 
rate  or  chauinel  capacity  is  clearly  proportional  to  the  signal-to-noise 
ratio  but  it  is  also  proportional  to  the  modulation  bandwidth  B.  Thus, 
maximizing  the  information  rate  maximizes  the  signal-to-noise  ratio 
subject  to  a  restraint  on  system  modulation  bandwidth  or  response  time. 


Equation  (13)  is  an  example  of  the  calculation  of  information  rate 


In  the  work  which  follows,  more  complicated  expressions  will  have  to  be 
evaluated.  The  more  complicated  expressions  arise  because  the  restrictive 
conditions  mentioned  following  Equation  (13)  ere  no  longer  satisfied.  One 
of  the  principal  complicating  effects  when  information  theory  is  applied 
to  system  design  is  that  of  filtering. 

5.  EFFHJCT  OF  FILTERING  ON  ENTROPY 

Even  if  the  time  samples  are  independent  at  the  input  to  filter, 
they  will  in  general  not  be  independent  at  the  output,  since  filtering 
produces  a  non>flat  output  spectrum.  Only  in  the  special  case  where  the 
filter  attenuation  characteristic  is  flat  will  the  output  time  samples 
remain  independent.  The  joint  entropy  on  the  output  is,  therefore,  no 
longer  the  sum  of  the  marginal  entropies,  and  the  entropy  per  degree  of 
freedom  is  difficult  to  find.  It  is  possible  to  find  the  entropy  for  a 
particular  time  sample,  and  if  the  output  process  is  stationary  (which 
it  will  be  only  in  the  steady  state)  the  entropy  at  each  sample  is  the 
sane.  The  total  entropy  is  not  the  sum  of  the  entropies  at  each  sample 
point,  however,  since  the  successive  sample  values  are  not  independent. 

As  an  example  of  such  a  calculation,  suppose  x(t)  is  Gaussian  with 
mean  zero,  and  spectral  density  and  u(f)  is  the  unit  step  function 

G^(f)  X  w^  ||u(f+B)  -  u(f-B)J  (14) 

Let  x(t)  be  the  input  to  a  linear  filter  whose  transfer  function  is 
T(f)  .  The  output  y(t)  will  also  be  Gaussian  since  a  linear  operation 
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on  a  3aussian  vHrifjble  yiel  is  a  Jijssiaji  variable.  The  output  spectral 
density  is 

GyCf)  .  3^(f)  |T(f)|^  (15) 

and  the  output  variance  is 


Equation  (19)  is  the  entropy  for  a  particular  citne  sample.  It  is  not  the 
average  entropy  per  sample  except  for  special  cases  and  t:.u3  will  not 
yield  the  total  entropy  when  multiplied  by  the  number  of  samples.  To 
find  the  average  entropy  per  sample  for  an  arbitrary  filter,  it  is  more 
convenient  to  consider  freouer.cy  ra*r.er  than  time  sampling.  The  transforma¬ 
tion  between  fre:iuency  E«nd  time  ssmrling  points  is  orthogonal,  hence  the 
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oince 


Jacobian  of  the  tranoi'orrr-a ' icn  has  the  absolute  value  of  unity, 
the  difference  between  the  entropies  of  two  variaoles  related  by  a  linear 
transformation  is  s imply  the  logarithm  of  the  absolute  value  of  the 
Jacobian,  it  follows  that  the  joint  entropy  and/or  total  entropy  of  the 
function  is  the  same  whether  expressed  in  terms  of  frequency  or  time  samples. 
The  entropy  per  degree  of  freedom  after  frequency  filtering  can,  therefore, 
be  calculated  from  the  effect  of  frequency  filters  on  frequency  rather 
than  time  samples.  This  greatly  simplifies  the  calculation  of  the  output 
entropy,  since  the  relationship  between  input  (x)  and  output  (y) 
frequency  samples  is  the  simple  product 

Y(ju))  =  T(ju)  X(jwi)  (20) 

The  frequency  samples  at  the  filter  output  are  independent  since  multiplica¬ 
tion  of  the  input  variables  by  a  nonrandom  function  doe;:  not  affect  their 
independence.  If  the  frequency  samples  are  stationary  in  frequency  as  well 
as  independent,  the  total  entropy  can  be  found  by  simply  multiplying  the 
average  entropy  per  sample  by  the  total  number  of  samples.  Although 
X(ju)  may  be  "stationary  in  frequency,"  Y(ju)  is  not  in  general.  The 
term  "stationary  in  frequency"  means  that  the  probability  distribution  of 
amplitudes  in  the  frequency  domain  is  not  a  function  of  the  origin  in 
frequency.  Equation  (26)  can  be  used  to  generate  a  set  of  linear  equations 
relating  Y^  and  X^  .  The  successive  samples  are  indepiendent ,  so  that 
the  total  or  joint  entropy  is  the  sum  of  the  entropies,  and  the  entropy 
p>er  degree  of  freedom  is  this  sum  divided  by  2TB. 


It  can  be  shown  that  the  joint  entropy  after  filterin.'^  is  related 
to  the  joint  entropy  before  filtering  by 

TB 

H(Y^,  y^,  •••  Y^)  -  H(X^.  X^,  •••.  X^)  ♦  ^  |T(f^)|^  (21) 

i.l 

Note  that  the  upper  limit  on  the  sum  is  T3  rather  than  2TB  since  of  the 
STB  independent  samples  needed  to  define  the  function,  TB  are  associated 
with  the  real  part  and  TB  with  the  imaginary  part.  Since  the  samples  are 
independent,  the  average  entropy  per  sample  is 


H(Y) 


H(Yi,  Yg,  ...,  Y.^) 

2TB 


TB 


I 


H(Y^) 

2TB 


V  -A 

L  2TB  *  2T 
i-1 


IS 

^  in  |T(f^)|^ 
i.l 


(22) 


If  TB  »  1,  the  frequency  samples  are  equally  and  closely  spaced  in  the 
band  B.  Let 


Af 


1 

f 


(25) 


TB 

H(y)  .  (H(X)  +  T  ^  in  |T(fj^)|^  Af 

i-1 


=9  H(X) 


I 


/n  |T(f)  df 


(24) 
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Equation  (2U)  actually  represents  the  average  entropy  per  degree  of 
freedom  on  the  output,  where  a  “degree  of  freedom"  refers  to  either  a 
time  sample  or  a  frequency  sample,  since  the  average  entropy  per  sample 
is  the  same  for  both  types  of  seunples.  It  is  interesting  to  compare 
Equation  (24)  with  Equation  (19).  In  general,  these  two  results  are 
different,  since  the  log  of  the  integral  is  not  equal  to  the  integral  of 
the  log.  In  the  special  case  where  the  filter  characteristic  is  constant 
across  the  band  B  the  two  are  the  same,  however. 

Note  that  the  successive  input  time  samples  are  independent  if  the 
power  spectral  density  of  x(t)  is  constant  over  B.  In  other  words, 

"  I  I^T  1^  “  *'o  '  u(f-B)]  (25) 

where  w  is  a  constant.  This  condition  is  sufficient  to  insure  that 
o 

samples  spaced  t  =  ~  apart  be  uncorrelated.  Likewise,  the  successive 
Input  frequency  samples  are  independent  if  the  power  spectral  density  of 
X(f)  is  constant  over  T  .  In  other  words, 

G^(t)  -  i  I  CX(f)]  j  ^  (26) 

If  Equation  (26)  is  satisfied  by  x(t)  ,  the  frequency  samples  spaced 

f  >  ^  apart  are  uncorrelated, 
o  T 

Let  us  now  consider  the  case  where  the  average  power  for  X(f)  per  unit 
time  interval  is  constant  but  where  the  amplitudes  of  the  frequency  spectra 
of  both  signal  and  noise  are  non-flat.  Since  successive  frequency  samples 


are  Independent  in  this  case,  but  the  process  is  not  stationary  in 
frequency,  the  total  entropy  will  be  the  sum  of  the  entropies  contrib¬ 
uted  by  infinitesimal  frequency  bands  over  the  bandwidth  B  .  If  a  non¬ 
flat  Qaussian  signal  S(f)  is  added  to  independent  non-flat  Gaussian 
noise  N(f)  ,  the  useful  entropy  at  the  output  of  the  adder  at  a  particular 
frequency  Cx(f^)  =  S(f^)  ♦  N(r )]  will  be 

Hui  -  H(X^)  -  H(Xj^ls^)  »  i  /n  ^  ^  ^  (27) 

where 


a 


2  2 
Xi'Si  = 


G  (f.)  I 
n  i  2 


(28) 


V'l'  I 


V'i’i  i 


where  the  capital  letters  refer  to  frequency  samples  and  the  lower  case 
letters  to  time  samples.  G^(f),  G^(f),  and  G^Cf)  are  the  frequency 
spectral  densities  for  the  time  functions  x(t)  =  input,  n(t)  =  noise, 
and  y(t)  =  output,  respectively.  Equation  (27)  becomes 


.  G  (f.) 

H^(X.)  =  5  in  (1  .  5^) 


(29) 


To  get  the  total  useful  entropy,  the  sum  must  be  taken  over  the  total 
number  of  frequency  samples,  which  is  2BT,  Thus, 


H  ,  .  1 

u  total 


BT 

“.t  ■  Z  i  0 

i*-BT 


G 
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Vj 
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(30) 
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Now  let  T  approach  infinity  and  if  approach  0  in  such  a  way  that 


Tif  w  1 


(31) 


Now  that  we  have  an  infinite  number  of  samples  of  infinitesimal  separation, 
the  sum  will  approach  an  integral  as  follows: 

»«t  -  i  I  G  *  gtt)) 

i»-BT  “  ^ 

T  r®  ^ 

-  2  J  0  •  Q-rn)" 

— D  n 

The  average  entropy  per  frequency  sample  is  then 

H  t  1  r  ®  A 

“  Ifl  *  4  J  „  0  *  TUTy  ^”) 

— £3  n 

Observe  that  if  the  signal  and  noise  spectral  densities  are  constants 

given  by 


G,(f)  -  S„ 

9  O 


G  (f)  w  N 
n  o 


(3^) 


then  equations  (52)  and  (35)  reduce  to,  respectively. 


“u,  ■  i  j , •  r}  «  ■  =■ (,1  •  r) 

•o  o  o 


(55) 


H  -  i  fn  fl  + 
u  2  V  N  , 
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Equations  (35^  and  (36)  are  the  results  which  would  have  been  obtained  had 
the  spectra  been  flat  to  be^in  witn.  Now  consider  the  situation  depicted 
in  Figure  3 


^s+n(*) 

I - -I - ,  So  -N, 


Figure  3.  Signal  and  Noise  Spectrum  for  Simple  System 

where  T(a)  is  a  linear  filter.  Let  the  signal,  S  ,  be  Gaussian  with 
mean  0  and  spectrum  3^  over  a  bandwidth  from  -Bg  to  Bg  and  let  the 
noise,  N,  be  Gaussian  with  spectral  density  over  the  bandwidth 

-Bjj  to  Bjj,  where 

iBgl  <  (37) 

The  total  entropy  at  x  will  be 

•  H^(S+N)  -  H^(N) 

.  BgT  tn  [2ae(a_^+ajf)'!  -  (3,^  -  Bg)  Tin  [2ne 

a„  2 

-  BjjT  Xn  [2aeOjf]  =  BgT  in  (l  +  (38) 
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where 


ffg  =  variance  of  a  3i»;nal  frequency  sample 
»  variance  of  a  noise  frequency  sample 

Notice  that  the  useful  entropy  at  x  is  exactly  the  same  as  if  B  *  B  . 

n  8 

Thusi  so  long  as  Equation  (37)  is  satisfied,  the  useful  information 
contributed  by  the  noise  alone  is  0  .  This  is  in  keeping  with  the  con¬ 
cept  that  the  noise  contributes  no  useful  information.  Now  linear 
filtering  is  a  linear  transformation  on  the  frequency  samples,  and  the 
useful  entropy  is  invariant  under  linear  transformation  ,  According  to 
Equation  (58),  the  total  entropy  is  independent  of  the  noise  bandwidth  so 
long  as  it  is  greater  than  or  equal  to  the  si.inal  bandwidth.  Thus,  if 
maximization  of  entropy  is  to  be  used  as  the  optimization  criterion  to 
determine  the  bandwidth  of  the  transfer  function  T(s)  (assumed  to  be  a 
square  filter),  there  is  no  constraint  on  its  bandwidth  except  that  it  be 
greater  than  or  equal  to  the  spectral  width  of  S.  This  conclusion  is  a 
direct  contradiction  to  the  conclusion  that  would  be  drawn  from  the  mean 
square  error  criterion,  wi.ich  says  that  the  transfer  function  spectral 
width  should  be  exactly  equal  to  Bg  so  as  to  maximize  the  siicnal-to-noise 
ratio  at  y  .  This  is  a  very  good  illustration  of  the  difference  between 
control  and  information  theoretic  concepts.  The  control  airgument  says 
make  the  average  signal-to-noise  ratio  per  frequency  sample  as  great  as 
possible  and  maximize  t};e  overall  power  siTnal-to-noise  ratio  on  the  output. 

• 

F.M.  Beza,  An  Introduction  to  Information  rneory,  Kcuraw-Hill,  1961,  p.  277. 
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The  information  theoretic  approach  says  that  it  is  unnecessary  to  do  this 
—  that  all  that  is  required  is  that  the  total  information  be  passed. 

Thus,  although  the  average  signai-to-noise  ratio  per  frequency  sample  is 
smaller  the  larger  gets  relative  to  ,  the  larger  the  number  of 

samples,  so  that  the  total  average  information  remains  the  same.  The 
fact  is  that  the  information  is  in  a  more  usable  form  if  B_  =  .  But 

the  information  is  indeed  available  if  condition  37  is  satisfied.  It 
also  specifies  what  that  maximum  availaole  useful  information  is.  It 
will  be  shown  in  later  reports  that  tne  condition  of  noise  at  the  output 
of  the  filter  does  cause  an  interaction  between  the  transfer  function 
T(b)  and  the  output,  or  receiver,  noise.  Under  these  circumstances,  the 
useful  entropy  is  indeed  a  function  of  the  transfer  function  T(s)  . 

Information  theoretic  concepts  may  be  applied  to  the  optimization 
of  feedback  and  servo  system  design.  A  general  servo  system  block  diagram 
appears  as 


Figure  V,  Typical  Servo  Block  Diagram 
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where  d(t)  is  load  disturbance,  n^(t)  is  .r.easurenent  noise  or  internal 

amplifier  noise,  n  is  noise  at  the  input  to  the  servo  and  r  is  the 

desired  input  siijnal.  r^^  =  r  +  n  is  the  total  input  signal.  c(t)  is 

the  servo  output.  Internal  sources  of  noise  such  as  d(t)  and  n^(t) 

affect  the  output  information  rate  at  c  in  such  a  way  that  the  servo 

transfer  function  is  related  to  d(t)  and  n^(t)  .  The  output  entropy 

or  rate  is  dependent  on  the  servo  transfer  function  in  this  case,  and  the 

system  may  be  optimized  by  maximizing  the  output  rate  as  a  function  of  the 

variable  parameters  in  the  servo  transfer  function.  The  open  loop  transfer 

functions  are  G  (s)  ,  the  series  compensator,  3,(s)  ,  the  olant,  and 
c  i  ‘ 

HCs)  ,  the  shunt  compensator.  In  the  case  where  d(t)  and  n^(t)  are  0, 
however,  the  output  entropy  of  c  is  independent  of  the  servo  function 
and  is  simply  equal  to  the  injxit  entropy  at  .  If  the  servo  is  to  be 
optimized  with  signal  and  noise  only  on  the  input  and  no  constraints  on 
the  system,  the  optimization  by  the  moan  square  error  criterion  is  the 
Wiener  problem.  Let  us  replace  the  servo  in  Figure  4  by  the  simpler 
structure  in  Figure  5. 


Figure  5»  Simplified  Servo  Block  Diagram 
The  error  is  defined  as 

KGr^^ 

^  -  TTkg  =  ‘'-"*^1 


(39) 
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where 
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(40) 


■  r  ♦  n 


(41) 


The  spectral  density  for  the  complex  variable  s  is  defined  as 

0(s)  .  lim  2  (42) 

•P-OB 

where  F*(s)  «  F(-s)  and  where  the  bar  or  overscore  denotes  an  ensemble 
average.  The  corresponding  spectral  density  in  the  real  frequency  variable 
(1)  is  defined  as 

a(f)  -  lim  I  lF(ju)|^  where  -  |  <  t  <  |  (43) 

T-®  -  -  c: 

From  Equation  (39))  the  spectral  density  for  e  is 


FF'0  -0  F' 


-  0  F 

rri 


(44) 


The  optimum  closed  loop  transfer  function  can  be  found  by  satisfying  the 


following  equation 

5f!-  •  Xj  . 

F0  -  0 

*-l*'l  r^r 

(45) 

where 

is  analytic 

in  the  left  half  plane  and  on 

the  imaginary  axis. 

Let 

ZZ*  «  0 
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(46) 
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r 


and 


0, 


r,r 


=  0. 


then 


(47) 


FZZ'  -  0  ^  -  X,  (48) 

rj^r  j. 

F  by  definition  must  have  poles  and  zeroes  only  in  the  left  half  plane  or 
on  the  imaginary  axis  since  it  is  to  be  a  stable  closed  loop  transfer 
function.  Z  by  definition  has  left  half  plane  poles  and  zeroes  only, 
so  that  Z*  has  right  half  plane  poles  and  zeroes  only.  0^  ^  may  have 
both  left  and  right  half  plane  poles  and  zeroes,  but  may  be  expanded  into 
a  partial  fraction  expansion  such  that 


where 


0 


rir 


(0r 


(49) 


and 


) 

++ 


has  poles  in  the  LHF  only 


rj^r  - 


has  poles  in  the  RHF  only 


Rewriting  Equation  (48)  so  that  terms  having  left  half  plane  poles  only 
are  on  the  left  half  side  of  the  equation  and  terms  having  right  half  pleme 
poles  only  are  on  the  right  half  side  of  the  equation,  we  have 


(50) 
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The  only  way  that  both  sides  of  Equation  (50)  can  be  equal  to  each  other 
is  if  they  are  equal  to  a  constant  and  it  turns  out  that  constant  nust  be 
equal  to  zero.  It  therefore  follows  ti.at  the  solution  for  F  is 


Equation  (51)  yields  the  optimum  closed  loop  transfer  function  for 
minimizing  the  mean  square  error,  and  when  this  function  in  Equation  (5I) 
is  substituted  back  into  Equation  (^4),  the  least  mean  square  error  results. 
The  condition  for  physical  realizability  of  the  trainsfer  function  has  been 
inherently  included  in  this  development. 

Unfortunately,  the  foregoing  procedure  does  not  seem  to  be  feasible 
for  the  technique  of  maximization  of  information  rate  with  respect  to  the 
variable  parameters  of  the  system.  For  one  thing,  the  variable  parameters 
of  the  system  are  unknown  in  the  Wiener  problem  and  are  specified  only 
after  the  optimisation  has  tatcen  place.  In  the  second  place,  the  information 
rate  criterion  involves  the  bandwidth  of  the  system,  in  this  case,  the  closed 
loop  bandwidth  of  the  servo,  which  is  specified  only  if  the  form  of  the 
transfer  function  is  known,  but  is  unspecified  in  the  Wiener  problem.  It 
therefore  follows  that  the  optimization  must  take  place  after  the  closed 
loop  trsmsfer  function  and  the  open  loop  transfer  function  have  been 
specified  as  to  form.  This  procedure  has  the  advantage  that  thj  complexity 
of  the  system  Ccui  be  specified  by  specifyin;  the  order  of  the  differential 
equation  describing  tlie  system  before  the  optimization  takes  place.  From 
Equation  (39),  the  mean  square  error  cam  then  be  written  as 
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-  f  G  (f)  df  (54) 

•*0 

and  this  mean  square  error  is  then  minimized  with  respect  to  the  variable 
parameters  of  the  system  by  taking  partial  derivatives  of  the  mean  square 
error  with  respect  to  those  paraureters  and  setting  them  equal  to  zero. 

This  procedure  yields  the  optimum  set  of  parameters  in  the  mean  square 
sense. 

For  the  Wiener  problem,  the  total  useful  entropy  and  useful  rate  on 
the  output  of  the  servo  are  the  same  as  that  on  the  input,  since  the  closed 
loop  servo  is  simply  the  equivalent  of  a  linear  transfer  function.  Therefore, 
this  procedure  does  not  yield  any  way  of  optimizing  the  closed  loop  transfer 
function  in  an  information  theoretic  sense.  It  will  now  be  proposed  to 
define  the  minimization  of  the  useful  entropy  of  the  error  as  the  optimiza¬ 
tion  criterion  rather  them  the  maximization  of  the  useful  entropy  of  the 
output  in  order  to  get  an  information  theoretic  criterion  which  is  closer 
in  both  form  and  concept  to  the  meam  square  error  criterion.  The  minimiza¬ 
tion  of  the  information  content  or  rate  of  the  error  criterion  is,  to  the 
best  of  our  knowledge,  completely  new.  The  following  argument  shows  that 
it  can  be  justified  on  fundamental  grounds.  The  proposed  criterion  is 
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similar  to  the  mean  square  error  criterion  and  leads  to  similar  results 
when  the  probability  distribution  is  Gaussian.  The  criterion,  which  is  the 
minimization  of  the  useful  entropy 

H^(£)  -  H(c)  -  H(cir)  »  H(r)  -  H(r|c)  (55) 

is  particularly  useful  when  used  with  nonlinear  systems,  since  a  non- 
lineeirity  results  in  a  non-Gaussian  distribution  even  when  the  input  is 
a  Gaussian  distribution.  A  probabilistic  criterion,  such  as  that  in 
Equation  (55)«  taltes  into  account  differences  between  the  signal  and  the 
noise  in  moments  higher  than  the  second,  and  thereby  produces  improved 
discrimination  over  that  which  would  be  provided  by  the  mean  square  error 
criterion  alone.  The  error  is  defined  as 

c  =  r-c  (56) 

Ideally  the  error  should  go  to  zero  so  that  r  =  c  .  The  correlation 
between  the  error  and  the  signal  input  (all  means  assumed  zero)  is  then 

er  «  r^  -  rc  (57) 

In  the  ideal  situation  where  e  =  0  , 

er»i^-r^=0  (58) 

Thus  if  c  and  r  are  ancorrelated  and  the  distribution  is  such  that 
they  are  independent. 


-29- 


H(c|r)  =  H(c) 


(59) 


and 


*  H(e)  -  H(clr)  =  K(c)  -  H(c)  -  0  (60) 

Equation  (60)  leads  one  to  the  logical  conclusion  that  the  system  will  be 
optimized  if  the  useful  entropy  of  the  error  is  minimized.  It  is  also 
Interesting  to  observe  that  the  servo  error  must  be  simply  the  noise  on 
the  input  in  the  ideal  situation  when  c  in  Equation  (56)  is  zero.  The 
"servo  error"  is  defined  as 

e»rj^-c«r*n-c  (6I) 

Ideally,  c  ■  r  -  c  »  0  and 

e  ■  n  (62) 

The  basic  conclusions  in  Equations  (5^)  through  (62)  do  not  tedte  into 

account  the  physical  realizability  constraints.  When  this  constraint  is 

taken  into  account,  the  meaui  square  error  criterion  yields  a  Wiener  filter. 

When  a  servo  is  optimized  in  the  Wiener  sense,  the  servo  error  is  white 

noise  .  This  conclusion  is  illustrated  in  the  following  example  (refer 

.2 

to  Figure  5):  Let  0  =  —g — g  and  0  *  1 

rr  c  c  nn 

w  +a 


R.  E.  Kalman,  A  N'ew  Acoroach  to  Linear  Filtering  and  Frediction 
Trans,  of  the  ASy£,  Karch  i960,  page  ‘*2,  item  m. 


Froblens , 
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*  r(r-c)tnj  [r-n]  =  n(rf.-.)  =  n  r  i-  (63) 

Thus,  the  correlation  functior.  relatin'  the  input  sijnal  plus  noise  an-i  the 
servo  error  at  the  sarre  instant  of  titr.e  is  found  to  be  si.r.ply  the  mean  noise 
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where  the  mean  of  the  noise  is  assumed  to  be  zero.  The  total  entropy  in 
Equation  (57)*  where  G  (3)  is  defined  in  Equation  (53)  and  G  1  (f)  is 

€  C  IP 

defined  in  Equation  (68),  is  then  maximized  with  respect  to  the  variable 
parameters  in  the  transfer  functions  by  taking  the  appropriate  partial 
derivatives. 

It  is  illustrative  to  illustrate  the  mean  square  error  approach  and 
the  maximization  of  the  useful  entropy  of  error  approach  with  a  simple 
example.  Let  G(s)  be  1  in  Figure  5  so  that  we  are  trying  to  optimize 
a  simple  positioning  servo.  K  is  then  the  parameter  to  be  varied  to  yield 
the  optimum.  First  consider  the  mean  square  error  approach. 


or 


F 


K 

l^K 


A 


1 

1+K 


2  .2  2  ^  Z 

e  *  A  r  +  F^n 


Z  ^  Z 
r  t-K  n 

(1-^K)2 


ae2 

nr 


(IfK)^  •  2Kn^  -  (r^tK^n^)2(l-i-K) 
(l+K)'" 


=  0 


(1+K)  Kn^  =  r^  +  K^n^ 


(69) 

(70) 


2 

K  .  (71) 

According  to  Equation  (71),  the  mean  square  error  is  minimized  when  the 
servo  gain  is  equal  to  the  power  signal  to  noise  ratio.  Let  us  now  proceed 
to  do  this  same  problem  by  the  minimization  of  total  information  or  informa¬ 
tion  rate  of  the  error  criterion. 


33- 


Let 


Z 


T 


.  .  aV  . 

(72) 

^  ■  (e  -  elr)^  ■  F^n^ 

(73) 

p 

,  9~ 

-  BT  in  Ql  «  BT  in  ( 

®elr 

^  ^  ^  ^  1 

(74) 

=1  (75) 


The  information  rate  is  then 


R 


B  in 


(76) 


Assume  that  Equation  (69)  holds  only  over  the  servo  bandwidth  and  that  the 
closed  loop  servo  bandwidth  increases  linearly  with  gain  K  .  This  is  very 
nearly  the  case  in  practice.  The  rate  in  Equation  (76)  then  becomes 


(77) 


(78) 
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Notice  the  close  similarity  between  solutions  in  Equations  (71)  and  (78). 
This  example  illustrates  the  similarity  between  the  two  approaches  and 
suggests  that  the  two  approaches  can  be  expected  to  yield  similar  results. 

In  the  forthcoming  period,  research  will  continue  along  lines  of  investiga¬ 
tion  into  various  optimization  criteria.  In  particular,  the  approaches 
Investigated  during  the  past  period  will  be  considered  further.  Both  the 
criteria  of  maiximization  of  information  rate  at  the  output  and  the  min-imiza- 
tion  of  information  rate  at  the  error  will  be  considered  for  more  complicated 
systems,  particularly  those  involving  noise  entering  at  points  other  than 
at  the  input.  So  far  only  the  unconstrained  or  Aiener  case  has  been 
considered.  Extension  of  the  information  rate  maximization  criterion  to 


constrained  cases  will  also  be  considered.  Ultimately  it  is  hoped  to 
develop  an  extensive  optimization  of  information  rate  theory  analagous  to 
that  presently  existing  for  the  oeein  square  error  criterion. 


PART  II 


ESTIMATION  AND  OPTIMIZATION  OF  NONLINEAR  DYNAMIC 
SYSTEM  PARAMETERS  BY  RE3RESSI0N  ANALYSIS  METHODS 

1.  INTRODUCTION 

Modern  theoriee  of  control  are  generally  based  upon  an  assumption 

that  an  accurate  quantitative  model  for  the  object  being  controlled  is 

available.  There  are  many  practical  situations  in  which  this  is  not  the 

caae.  In  such  circumstances,  it  becomes  necessary  to  devise  a  computational 

procedure  which  will  permit  the  inference  of  a  mathematical  description  for 

the  controlled  object  from  records  of  input  and  output.  This  problem  has 

been  attacked  with  considerable  vigor  and  substantial  success  by  a  large 

number  of  investigators  for  the  important  but  restrictive  class  of  systems 

poaseasing  the  properties  of  linearity  and  time  invariance  [l,  2,  3,  A] 

In  comparison  to  the  linear  problem,  little  progress  has  been  made 

toward  a  practical  solution  of  the  more  general  problem  of  "identification" 

of  nonlinear  dynamic  systems  While  a  very  broad  theory  for  nonlinear 

systems  has  been  advanced  by  N.  Wiener,  certain  rather  serious  computational 

difficulties  inherent  in  his  formulation  have  retarded  the  application  of 

this  theory  to  real  physical  systems  [6,  7,  8],  The  difficulties  encountered 

in  attempting  to  make  use  of  the  Wiener  theory  seem  to  arise  from  the  very 

generality  of  his  approach.  Wiener  assumes  at  the  outset  that  a  state  of 

complete  ignorance  exists  concerning  the  nature  of  the  object  to  be  identified 

This  is  rarely  the  case  when  real  physical  devices  are  under  consideration. 

In  a  great  many  situations,  basic  physical  theories  permit  the  construction 

^In  this  docuoient,  superscript  numbers  refer  to  footnotes,  while  bracketed 
numbers  correspond  to  the  list  of  references  collected  at  the  end  of  the  text 

^The  term,  "identification",  appears  to  have  been  coined  by  L.  A.  Zadeh  to 
describe  processes  which  seek  to  infer  a  mathematic  description  for  a  dynamic 
system  from  input  and  output  data  [3]. 


of  a  parametric  model  of  finite  dimensionality  for  the  system  under  Investiga¬ 
tion  and  it  is  only  the  numerical  values  of  parameters  which  remain  unluiown. 

A  simple  example  of  a  parametric  model  is  provided  by  the  equation 
for  the  motion  of  a  damped  pendulum.  Straightforward  sununation  of  moments 
yields  the  equation 

ie  .  -  Hgi  sin  e  -  BO  (1) 

which  can  be  normalized  to 

♦  Cj^O  sin  0=0  (2) 

The  determination  of  and  c^  £Llong  with  two  initial  conditions 

completely  Identifies  this  system  A  more  intriguing  example  for  which 
a  parametric  model  exists  can  be  obtained  by  considering  the  equations 
describing  the  atmospheric  re-entry  of  a  ballistic  vehicle.  In  this  case, 
nonlineair  trajectory  prediction  becomes  possible  if  the  important  vehicle 
parameters  can  be  obtained  in  a  sufficiently  short  period  of  time.  Other 
situations  in  which  parecsetric  models  exist  are  not  hard  to  imagine. 

When  an  appropriate  parametric  model  for  an  unknown  system  has  been 
established,  then,  if  the  system  is  to  be  identified  by  a  process  of  parameter 
estimation,  it  becomes  necessary  to  state  an  explicit  criterion  for  the 
adjustment  of  the  model  to  match  the  data.  A  criterion  which  has  been 
widely  utilized  in  linear  problems  is  the  integral-squared  error  function 

^  In  the  discussion  to  follow,  a  set  of  such  coefficients  and  initial 
condition^  will  be  called  a "parameter  vector"  and  will  be  denoted  by  the 

symbol •  c  . 
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evaluated  over  some  interval  of  operation.  Specifically,  let  y(t;c) 
represent  the  theoretical  or  model  response  for  a  specified  set  of  parameters, 
c  .  Then  if  the  tnie  parameter  vector  is  indicated  by  c^  ,  the  observed 
response  will  generally  consist  of  the  theoretical  response  plus  an  error 
term;  l.e.,  if  experimentally  observed  system  output,  then 

y^(t)  «  y(t;c^)  +  c(t)  (3) 


where  e(t)  represents  an  error  function.  This  error  may  be  due  either  to 
measurement  error,  noise  internally  generated  in  the  system  under  study, 
imperfections  in  the  assumed  model,  or  possibly  a  combination  of  these  and 
other  effects.  In  any  event,  the  integral  squared  error  function  is  given 

by 


T 

0(c;y^)  -  J  [y^,(t)  -  y(t;c)  J  dt 

o 

A  "least  squares"  estimate  of  the  true  parameter  vector,  c^  ,  can  be 
obtained  by  finding  a  vector,  c^  ,  such  that 


(4) 


min 

c 


(5) 


Estimation  of  nonlinear  system  parameters  according  to  this  criterion  forms 
the  basis  for  the  research  to  be  reported  in  the  following  discussion.  The 
inference  of  mathematical  models  from  experimentetl  data  based  on  such  estima¬ 
tion  will  be  referred  to  as  the  "parameter  space  method". 


Most  of  the  remainder  of  this  study  is  concerned  with  the  development 
an  explicit  computational  technique  for  the  achievement  of  the  desired  least 
squares  estimation  of  nonlinear  system  parameters.  The  particular  procedure 
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to  be  described  is  derived  from  the  methods  employed  in  classical  statistics 
for  the  estimation  of  linear  system  parameters;  i.e.,  from  the  subject  of 
"linear  regression  analysis"  [9%  However,  before  continuing  the  discussion 
of  computational  procedures,  it  is  uortr.while  to  note  that  the  problem  of 
parameter  estimation  is  very  closely  related  to  the  problem  of  parameter 
optimization.  If,  for  example,  the  desired  response  of  a  network  and  the 
form  (possibly  nonlinear)  of  the  network  are  both  specified  in  advance,  then 
adjustment  of  the  network  parameters  to  obtain  an  approximation  to  the  desired 
response  which  is  "best”  in  a  mean  square  sense  is  essentially  the  same 
problem  as  the  estimation  of  unknown  system  parameters  [lO],  While  the 
discussion  to  follow  will  use  terminology  consistent  with  perameter  estima¬ 
tion,  the  results  obtained  translate  very  simply  into  parameter  optimization 
problems. 


2.  LINEAE  REGRESSION  ANALYSIS 

2.1  Least  Squares  Estimation  and  Normal  Equations 

In  the  preceding  introduction,  a  parameter  space  characterization  of 
dynamic  systems  was  proposed.  It  was  suggested  that  a  very  general  method 
for  estimating  unknown  parameter  vectors  could  be  based  upon  the  minimization 
of  an  integral-squared  error  function  defined  over  the  parameter  space.  In 
classical  statistics,  a  very  similar  problem  is  treated  under  the  heading 
of  "regression  analysis".  In  regression  analysis  it  is  assumed  that  a  finite 
number  of  measured  values  of  a  variable,  say  ,  are  available.  It 

ia  further  assumed  that  there  are  random  errors  included  in  the  experimental 
values  for  y  ;  i.e.,  the  variable  actually  measured  is 
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^  °  •••  N  (6) 

The  objective  of  regression  analysis  is  then  to  determine  a  vectort  • 
which  eetlaates  the  true  parameter  vector,  c^,  by  making  use  of  the 
theoretical  response  function,  y(t;c)  .  This  estimation  is  conventionally 
accomplished  by  the  minimization  of  sum  squared  error;  i.e.,  if 


then 


i-1 


(7) 


(8) 


where  y^  is  an  N  dimensional  vector  composed  of  the  N  time  samples  of 
y^<t)  . 


The  minimization  of  integral  squared  error  discussed  in  the  above 
introduction  may  be  cast  in  the  form  of  regression  analysis  by  approximating 
the  Integral  by  a  sum.  That  is. 


T 

0(o;y^)  =  J  [y^(t)  -  dt 

N 

i-1 

N 


i-1 


(9) 


Since  the  N  time  samples  of  either  y^  or  y  may  be  regarded  as  N 
dimensional  column  vectors,  then,  except  for  a  scale  factor,  0  is  just 
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an  inner  product.  Because  minimization  of  0  is  unaffected  by  a  constant 
scale  factor,  it  will  be  assumed  in  all  that  follows  that  T/N  >  1  .  This 
being  the  case,  if  an  error  column  vector  is  defined  by 

•  ■  5^  -  y  (10) 

then 

!f(c;y^)  e'  e  (11) 

where  the  prime  denotes  a  transpose  •  In  the  remainder  of  this  inve5tigation» 
0  will  be  redefined  so  that  equation  (11)  is  exact;  i.e.,  least  squares 
estimation  will  be  based  upon  the  sum-squared  error  rather  than  the  integral- 
squared  error  function. 

When  the  theoretical  response  function,  y(t:c),  depends  linearly  upon 
the  parameter  vector,  o  ,  the  minimization  of  sum-squared  error  is  partic¬ 
ularly  simple.  In  such  circumstances ,  the  N  dimensional  vector  of  samples 
of  y  can  be  written 


y  »  A  ?  +  7  (12) 

o  o 

where  A  is  an  NxM  matrix  of  coefficients  and  e  is  a  random  error 
vector  The  function  0  may  be  written  in  this  case  as 

0(e;y^)  -  (y^-A7)’(y^-A7) 

-  e'e  (13) 


Note  that  there  are  two  sources  of  error  which  contribute  to  0.  There  is  i 
fira^  of  all,  the  difference  between  and  y^  which  results  from  the  fact 
c  /  Cq.  Secondly,  the  random  error,  c  ,  causes  0  to  be  greater  than  zero 

even  when  c  ■  • 

^  The  elements  of  A  are  usually  known  or  measured  values  of  independent 
variables  or  functions  of  independent  variables.  A  constant  term  may  be 
included  in  this  equation  by  choosing  one  column  of  A  such  that  every 
element  is  equal  to  one. 
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which  is  a  positive  definite  quadratic  form  in  the  coordinates  of  the  trial 
parameter  vector,  c  .  The  unique  miniBium  value  of  0  aay  be  found,  there¬ 
fore,  by  equating  to  zero  all  of  the  partial  derivatives  of  0  with  respect 
to  the  components  of  c  ;  i.e. 

0  •  -  y^jAc  -  (ac)  +  (ac)  (ac) 

-  y^y^  -  2c*A*y^  +  e  A  Ac  (14) 

ao 

^  m  -  2A  y  ♦  2a'Ac  (15) 

and 

■  -  2A  y  +  2A  Ac  =0  (16) 

ae  I  c  .  c^  °  * 

This  equation  reduces  to  the  stsuidard  "normal  equation"  for  least  squares 
estimation  [9]t 


a'ac^  X  A’y^,  (17) 

Assuming  A  is  of  full  rank,  this  relationship  may  be  inverted  to  give 

c^  -  [a'a]"^  A’y^  -  S"^  A’y^  (l8) 


The  parameter  vector  provided  by  equation  (l8)  is  thus  the  "least  squares 
estimate"  of  the  true  parameter  vector,  c 


If  the  number  of  time  samples  of  y(t)  is  exactly  equal  to  the  number  of 
unknown_^parameter3,  then  A  is  a  square  matrix.  In  this  case,  [a’A]“^a' 
i-lT 


A"^  so  c 


neous  solution  o?  linear  equations. 
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There  are  several  reasons  why  equation  (l8)  provides  a  very  desirable 
estiaate  of  .  To  begin  with,  since  the  matrix  A  is  an  array  of  known 
coefficients  (possibly  values  of  known  time  functions),  is  computed  by 

means  of  a  simple  linear  transformation  applied  to  the  data  vector,  y^  . 

The  linearity  of  this  relationship  results  solely  from  the  fact  that  the 
criterion  function,  0  ,  is  quadratic  in  the  parsuseter  vector,  c  .  If  any 
other  criterion  were  chosen,  differentiation  would  lead  to  a  nonlinear 
relationship  between  c^  and  the  data  vector.  So  far  as  the  computational 
aspects  of  parameter  estimation  are  concerned,  this  is  the  strongest  reason 
for  choosing  a  squared  error  criterion. 

From  a  statistical  point  of  view,  equation  (l8)  is  a  optimum  estimator 
in  two  senses.  First  of  all,  pzx>viding  only  that  the  random  measurement 
error,  e  ,  has  the  properties 


E(e)  -  0 

E(c  e’ )-  I  (19) 

where  I  is  a  unit  matrix,  it  follows  from  the  Gauss-Markof f  theorem  that 
equation  (l8)  provides  the  minimum  variance  unbiased  linear  estimator  of  the 
tzve  parameter  vector,  c  [9].  Secondly,  if  in  addition  to  possessing  these 
properties,  c  is  also  a  Gaussian  random  variable,  then  the  value  for  c 
specified  by  equation  (l8)  is,  moreover,  a  maximum  likelihood  estimate  of 

®o  • 

2.2  An  Example  of  Parameter  Estimation  by  Linear  Regression  Analysis 
Suppose  that  an  oscillating  system  is  known  to  be  governed  by  the 
equation 

y  ♦  ■  O  (20) 
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Th«  paraneter  vector  tor  this  equation  is 


e  - 


(21) 


nia  solution  to  equation  (20)  is  given  by 


y(ttc)B  y(0)  cos  wt 


sin  u)t 


(22) 


While  this  equation  is  nonlinear  in  u  •  it  is  linear  in  the  unknown  Initial 
conditions.  If  additional  information  exists  so  that  it  is  known  that 
M  ■  ,  then  the  coefficient  matrixi  A  ,  appearing  in  equation  (12)  is 

given  by 


cos  u.t, 
o  i. 


cos  w  t^ 
o  2 


cos  u  t 
o  n 


end  the  parameter  vector  c  is  reduced  to 


o 


(23) 


(24) 
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Since  the  sine  and  cosine  functions  are  linearly  independent,  A  is  of 
fnll  rank  (unless  all  rows  of  A  are  identical)  and  equation  (18)  may  be 
used  to  compute  estimates  of  from  sample  values  for  y  . 


2.3  Orthogonalization  in  Linear  Regression  Analysis 

When  the  columns  of  A  are  chosen  to  be  orthogonal  to  each  other, 
then  the  matrix 

S  ■  A  A  (25) 

is  a  diagonal  matrix. 


(26) 


In  such  circumstances,  equation  (17)  consists  of  an  uncoupled  set  of  linear 
equations.  If  the  columns  of  A  are  further  adjusted  by  a  change  of  scale 
so  that  they  are  orthonormal,  then  the  normal  equation  is  itself  a  solution 
for  since  S  becomei;  a  unit  matrix  sind  equation  (17)  therefore  reduces 

to 


Sc 


mi* 

Ic 

e 


e 


(27) 
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Inversion  of  S  is  thereby  avoided.  This  may  be  of  considerable  practical 
value  when  the  dimensionality  of  c  (and  consequently,  of  3)  is  large.  If 
A  is  thought  of  as  a  row  vector  with  orthonormal  column  vector  elements 


A  ■  (Aj^,  A^*  •  •  Aj^) 


(28) 


then,  from  equation  (27)  the  component  of  the  vector  c^  is  given 


^o  - 


(29) 


That  is,  the  components  of  c^  are  just  scalar  products  of  the  columns  of 
A  with  the  observed  output  variable. 

The  example  matrix  specified  by  equation  (23)  is  easily  orthogonalized 
by  a  proper  choice  of  sampling  times,  t^^  .  Unfortunately,  most  coefficient 
matrices  are  not  orthogonalized  so  simply.  However,  when  modern  digital 
computers  are  used  to  solve  the  normal  equations,  the  time  required  to  invert 

the  matrix  S  is  not  likely  to  be  significant  until  the  dimensionality  of  c 

7 

becomes  quite  large 


3.  NONLINEAR  REPRESS ION  ANALYSIS 

3.1  Nonlinear  Analysis  and  the  Need  for  Iterative  Methods 

While  linear  regression  analysis  produces  estimators  with  many  desirable 
characteristics,  the  assumption  of  linearity  with  respect  to  all  parameters  is 


7 

Exactly  what  constitutes  a  "large”  matrix  will  depend  upon  the  particular 
computer  used.  However,  matrices  up  to  tenth  order  will  certainly  be 
handled  with  ease  by  almost  any  computer. 
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scldoa  supportable  in  practical  circumstances.  Even  in  extremely  simple 
systems  such  as  the  one  exemplified  by  equation  (20),  it  will  usually  be 
found  that  the  dependence  of  solutions  on  some  parameters  will  be  very 
nonlinear.  Thus,  for  example,  if  it  is  assumed  in  equation  (20)  that  w 
is  also  to  be  estimated  from  data,  equation  (22)  shows  that  the  connection 
between  u  and  y  can  scarcely  be  approximated  by  a  linear  relationship 
(except  for  very  small  variations  in  u).  This  illustration  is  typical 
rather  than  exceptional.  Consequently,  it  is  necessary  to  consider 
minimization  of  error  functions  which  are  not  quadratic  in  the  parameter 
space. 

The  location  of  minima  for  arbitrary  surfaces  by  automatic  computa¬ 
tional  procedures  is  a  subject  which  has  received  considerable  attention  in 
recent  years.  The  mathematical  treatment  of  such  techniques  is  usually 
labeled  "mathematical  programming"  [ll].  In  the  language  of  mathematical 
programming,  the  sum-squared  error  function  of  regression  analysis  is  called 
a  "criterion"  or  "objective"  function  [11,12], 

In  problems  treated  by  mathematical  programming,  it  is  usually 
required  that  the  minimization  of  criterion  functions  be  carried  out  in 
some  bounded  subspace  of  the  parameter  space  which  is  specified  by  a  set  of 
constraint  equations.  When  both  the  criterion  function  and  the  constraint 
equations  are  linear,  the  minimum  is  unique  and  may  be  located  in  a  straight¬ 
forward  way  by  application  of  the  teclxniques  of  "linear  programming". 

Related  methods  have  been  developed  for  quadratic  criterion  functions  [ll]. 


For  general  nonlinear  problemst  onlj^  iterative  methods  exist.  Furthermore, 
there  may  well  be  more  than  one  minimum  associated  with  a  nonlinear  criterion 
function  so  that  it  becomes  necessary  to  examine  the  local  minima  to  find 
the  lowest  or  absolute  minioium.  The  research  reported  in  this  document 
deals  entirely  with  the  problem  of  determination  of  local  minima  for  non¬ 
linear  criterion  functions  in  unconstrained  parameter  spaces.  Subsequent 
reports  will  deal  with  the  solution  of  computational  problems  arising  from 
the  consideration  of  multiple  minima  and  constraint  equations. 

3.2  Paurameter  Estimation  by  Iterative  Linear  Regression  Analysis 

It  has  been  suggested,  occasionally,  that  nonlinear  parameter  estima¬ 
tion  problems  may  be  solved  by  iterative  application  of  linear  least  squares 
regression  analysis  [10,13,1**].  This  approach  differs  from  the  more  general 
techniques  of  nonlinear  programming  in  the  respect  that  it  is  restricted  to 
situations  where  the  criterion  function  is  a  sum-squared  error  function 
computed  from  data  and  the  response  of  an  assumed  parametric  model.  Wlien 
such  a  model  exists,  then  assuming  that  the  model  response,  y(t;c), 
possesses  first  partial  derivatives  with  respect  to  each  component  of  the 
parameter  vector,  c.  ,  these  derivatives  may  be  used  to  fit  an  approximating 

quadratic  surface  to  0(c;y^).  Let  X  be  the  matrix  of  partial  deriv- 

8 

atlves  given  by 


The  elements  of  X  have  sometimes  been  referred  to  as  "parameter  influence 
coefficients"  [15] •  A  computational  metnod  for  obtaining  these  coefficients 
will  be  presented  later  in  this  discussion. 
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(30) 


Then  en  approximation  to  the  response  vector  yCc^^  +  Ac)  can  be 
constructed  by  truncating  the  Taylor  series  for  y  to  obtain 


3r(Cj^  +  Ac)  ■  y(cj^)  +  XAc  (31) 

Associated  with  this  approximation  to  y  Is  an  approximate  criterion 
function: 

A  ^ 

0(^;cj,y^)  -  (y^  -  y)  (y^  -  y)  (32) 

Utilizing  equation  (10),  this  may  be  written  as 
Sf(Sc)  *'(e  -  xZc)  (e  -  XAc) 

-ee-eXAc-Ac  Xei-AcXXAc 
■  ^(Cj^;y^)  -  2Ac  X  e  -t-  ^  (33) 
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(and  therefore  solvable  by  matrix  inversion)  only  because  the  criterion 
function  is  quadratic  in  ic  . 


Since  the  validity  of  equation  (37)  rests  upon  an  assumption  of 
linearity •  there  is  no  reason,  apriori,  to  suppose  that  iterative  applica¬ 
tion  of  this  result  to  the  estimation  of  the  parameters  of  a  nonlinear 
system  will  produce  a  convergent  sequence  of  estimating  vectors.  Indeed, 
while  this  method  has  been  used  with  some  success  in  certain  problems 
[10,13]*  it  may  well  fail  to  converge  in  other  circumstances.  Necessary 
and  sufficient  conditions  for  convergence  and  other  aspects  of  stability 
are  discussed  later  in  this  report. 


3.3  Determination  of  Solution  Partial  Derivatives 

In  order  to  make  use  of  the  techniques  of  nonlinear  regression 
analysis,  it  is  necessary  to  compute  partial  derivatives  of  the  solution 
to  the  assumed  system  differential  equation  with  respect  to  each  unknown 
pairameter.  These  derivatives  car.  be  obtained  approximately  by  perturbing 
the  parameters  one  at  a  time  and  approximating  derivatives  by  finite 
differences;  i.e. 


- 


t  ••cj^)-y(t^  ;  c^,c^,'**,  c^.,' 


Ac, 


(59) 


Precise  determination  of  derivatives  in  t.'-is  manner  is  limited  by  round-off 
error  in  digital  computation  Euid  by  noise  in  analog  computation.  These 
difficulties  may  be  avoided  by  differe.ntiating  t.he  assumed  differential 
equation  with  respect  to  the  parameters  to  obtain  the  "parameter  influence 
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equations"  [15]*  The  solution  to  these  differential  equations  then  yields 
the  desired  partial  derivatives  as  functions  of  time. 


The  system  governed  by  equation  (20)  provides  a  convenient  basis 
for  the  illustration  of  the  parameter  influence  method. 


Let 


x^(t) 


Mil  .  MU 

dc^  iut 


(‘K)) 


Theni  differentiating  equation  (20)  with  respect  to  ui  : 


'  2u»y  (i>  ^  ■  0 

i  ,4.2  '  d« 

dMdt 

Assuming  continuity  of  the  appropriate  partial  derivatives, 
differentiation  can  be  reversed  yielding: 


(VI) 

the  order  of 


— ^  ^  ^  -  2u)y  (42) 

dm 

This  equation  may  be  written 


which  is  a  linear  inhomogeneous  equation  in  .  The  initial  conditions 
for  this  equation  are  obtained  from 


Xj(0) 

i^(0) 


dm 


JL 

it 


M21 

du> 


(44) 


Equations  (45)  and  (44)  taken  together  uniquely  specify  Xj^(t)  . 

Ordinairily,  the  forcing  function  y  in  equation  (43)  would  be 
obtained  by  simultaneous  computer  solution  of  equations  (43)  and  (20). 


However,  in  this  particular  case,  y  can  be  obtained  analytically  and  is 
given  by  equation  (22).  Thus,  the  equation  to  be  solved  for  is 

■  -  2uy(0)  cos  ut  -  2y(0)  sin  ut  (45) 

The  solution  to  this  differential  equation  with  the  initial  conditions  given 
by  equation  (44)  is 

Xj^(t)  ■  -  J^ty(O)  +  J  sin  lot  +  t  cos  u)t  (46) 

as  may  be  verified  by  substitution.  Since  equation  (46)  represents  the 

partial  derivatives  of  the  solution  with  respect  to  the  parameter  at 

all  values  of  time,  the  coefficients  x.,  of  the  matrix  X  are  obtained 

xl 

from  the  relation 

Xii  -  i  -  1,  2,  N  (47) 

The  validity  of  equation  (46)  may  be  ascertained  by  direct  differen¬ 
tiation  of  the  system  response  function,  equation  (20).  This,  of  course, 
represents  the  conventioned  method  for  obtaining  partial  derivatives  of 
an  analytic  function.  However,  while  this  operation  is  easier  for  a  human 
being  than  solving  a  differential  equation,  quite  the  converse  is  true  of 
an  electronic  computer.  The  purpose  of  the  parameter  influence  equations 
therefore,  is  to  provide  a  method  for  determining  partial  derivatives  which 
is  well  suited  to  computei'  rather  tl.an  human  implementation.  Further 
illustrations  of  parameter  influence  equations  and  their  computational 
solution  are  furnished  by  the  examples  provided  later  in  this  report. 
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3.^  Stability  and  Convergence  Properties  of  Causa-Mewton  Iteration 
3.4,1  Local  Stability  in  the  Absence  of  Measurement  Error 

An  Ideal  iteration  scheme  for  fitting  theoretical  response  curves 
to  data  should  not  only  be  convergent,  but  should  also  have  the  property 
that  as  a  ainimizing  value  for  the  parameter  vector  is  approached,  the 
ratio  of  the  computed  parauneter  error  to  the  true  error  tends  to  unity 
for  each  parameter.  That  is,  if  is  vector  such  that  is 

locally  minimum,  and  Ac^  is  the  parameter  change  vector  computed  at  the 
stage  of  iteration,  then  it  is  desirable  that 


(48) 


where  I  is  a  unit  matrix.  An  iteration  procedure  which  has  this  property 
will  be  termed  "asymptotically  efficient";  obviously  it  is  also  locally 
convergent.  An  equivalent  statement  to  equation  (48)  is 


^1  -  -  I  (49) 

aic  I  c  • 

When  experimental  conditions  are  such  that  the  response  function, 
y^,  can  be  measured  with  entirely  negligible  error  and,  moreover,  the 
parametric  model  for  the  process  producing  the  data  points  is  exact, 
then  will  itself  be  a  minimizing  value  for  ^  .  Moreover,  in  such 

circumstances 

e(c  )  »  y  -  y(c  )  ■  0  (50) 

o  o  o 
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so 

l?(c^;y^)  ■  e  •  ■  0  (51) 

It  turns  out  that,  under  these  conditions,  Gauss-Newton  iteration  is  not 
only  stable  at  c  «  ,  but  it  is  also  asymptotically  efficient.  More 

precisely,  if  denotes  the  parameter  change  vector  computed  at  c  ■ 

by  equation  (37),  then  the  following  theorem  states  a  set  of  conditions 
sufficient  to  ensure  the  asymptotic  efficiency  of  the  iteration: 


Theorem  1: 

Suppose  that  the  sum  squared  error  function,  0(c),  takes  on  the 
value  zero  at  c  »  c^  .  Then,  if  y(t:c)  and  its  first  partial  deriv¬ 
atives  possess  a  uniformly  convergent  Taylor  series  in  an  e-neighborhood 

ot  c  ,  it  follows  that 
o 

-  -  I  (52) 

o 

Proof: 

Let  8c  denote  the  true  parameter  error;  i.e. 

8c  -  c  -  c  (53) 

o 

Then  for  |8c|  <  e  , 


y(c) 


y^  X6c 


ifj 


dCi  9Cj 


6c .  6c  .  +  0( 6c^) 
1  J 


(5^) 


The  notation  O(ic^)  indicates  a  remainder  consisting  of  cubic  and  higher 
terms  in  ie  .  Now,  from  the  definition  of  0  ,  equation  (11) 


0  -  -  y)  (y^  -  y) 

so  substituting  from  equation  (3^) 

ft  m  Se'  x\  Cc  *  O(Sh^)  (55) 

Differentiating  this  expression,  , 

•jff  -  2x'x  Sh  +  0(«Js^)  (56) 

Now  from  equation  (37) 

<57> 

SO 

^i  "  *  ^®i  "  I  [^i^i]~^°^^i^ 

However,  under  the  continuity  and  full  rank  assumptions  regarding  X  , 

.  X  ♦  0(gh)  (59) 

so  equation  (3^)  reduces  to 

P  -  -  [^x'xj"^x'x  Sc  +  0(6c^)  (60) 


and  consequently 

si 

dc 
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5.4.2  Necessary  and  Sufficient  Conditions  for  Stability  in  the  General  Case 
In  real  problems  of  system  identification  it  is  not  to  be  expected 
that  Cf(c)  can  be  forced  to  zero  for  any  c  .  Therefore,  it  is  important 
to  determine  necessary  and  sufficient  conditions  for  the  convergence  of  the 
Oauss-Newton  procedure  in  the  general  case 

0(c;y^)  >  O  (62) 

c 

Tow&rd  this  6nd|  let  be  a  value  of  c  which  locally  ninlinizes  0  and 

let  be  the  response  vector  associated  with  this  paraaeter  vector*  Then 

a  matrix  D  f  of  second  partial  derivatives  may  be  defined  at  ?  *  c  as  ^ 

“  ■  <5.  -  ?.>■>] .  _ .  (63) 

J  c  ■ 

Another  matrix,  Q  ,  may  be  defined  from  D  as 

<3  -  Cx’x]’^  m  e  ^ 

e 

Utilizing  this  definition,  the  following  theorem  provides  the  desired 
conditions  for  local  stability  of  the  Gauss— Newton  procedure. 

Theorem  2; 

Suppose  that  y(t;c)  and  its  first  partial  derivatives  possess  a 
uniformly  convergent  Taylor  series  in  an  c-neighborhood  of  c^  in  parameter 
space.  Assume  also  that  the  matrix  X  is  of  full  rank  in  this  neighborhood. 

9 

The  elements  of  this  matrix  may  be  obtained  by  solving  the  appropriate 
parameter  influence  equations. 
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Then  if  danotes  the  trial  parameter  vector  at  the  i^^  stage  of  Iteration 

and  the  initial  parameter  vector,  •  is  chosen  so  that 

fc,  -  e|<e  ,0<6  <c  (65) 

,1  •  o  o 

it  follows  that  there  exists  an  c  for  which 

o 


11a 

i  -<D 


«i 


e 

a 


(66) 


for  any  satisfying  equation  (65)  if  and  only  if  all  of  the  eigenvalues 

of  the  matrix  Q  are  less  than  one  in  absolute  value. 

Proof; 

As  before,  let  Sc  stand  for  the  true  parameter  error  measured  from 
the  min imi zing  vector 


Then  an  equivalent  statement  to  equation  (66)  is 


(67) 


Tc.  -  3  (68) 

i-OB  ^ 


Now  for  lici  <  e 


7Cc)  ■ 


+ 


iy 

Z  L  8c.  ?c 
i.J  ^ 


6Ci  6Cj  +  O(rc^)  (69) 

c 


aOy  making  use  of  this  expansion 


-  y)  -  y) 


-  »‘'j 

i.J  ^  ^ 

*■  Se  X  Xic  ♦  O(ic^) 


Referring  to  equation  (36)  amd  notinj  also  that  (y  -  y  )  is  a  constant 

o  e 

vector  which  may  be  taken  inside  of  the  differentiation  operation,  the 
expression  for  0  reduces  to 

0  -  0(.c^)  ■*■  ^  Sc  +  Sc  (S  -  D)  Sc  O(Sc^)  (71} 


Since  c  ist  by  deflnitiont  a  ainiffluoi. 


■^(c^)  «  0 


and  0  simplifies  still  further  tc 


0  •  0(.c  )  +  Sc(3  -  D)  Sc  ♦  O(Sc^) 


Now,  utilizing  the  Gauss-Newton  iteration  equation 


where  denotes  an  evaluation  of  the  matrix  S  at  .  Because  of  the 

assumptions  regarding  y  and  X  ,  this  equation  may  be  rewritten 


-  -  I  7^  ♦  0{Sc) 


I 


Froa  equation  (73) 


If  and  only  if 


ii"*  -  C03  (83) 

This  altuatlon  in -turn  prevails^  if  and  only  if  all  eigenvalues  of  Q  are 
less  than  one  in  absolute  value.  This  completes  the  proof. 

The  following  additional  inferences  may  be  made  from  this  theorem: 
Corollary  I: 

If  D  at  [O]  ,  then  the  Qauss-Newton  procedure  not  only  converges i 
but  it  is  also  asymptotically  efficient.  This  condition  is  satisfied,  for 
example,  when  the  response  function  depends  linearily  upon  the  paurameters. 
It  is  also  satisfied  when  y^  corresponds  exactly  to  y^;  i.e.  when  no 
measurement  error  is  present  and  %  “  ® ^  •  Thus,  Theorem  1  is  a  special 
case  of  Theorem  2. 

Corollai'j  1 1 ; 

If  the  second  derivatives  of  y  are  computed  and 

<'!>>'  5^1  <«*> 

then  providing  the  second  partials  of  y  are  continuous  in  l£c|  <  c  , 
the  adjusted  Gauss-Newton  iteration  equation  given  by 

^i  ”  *i  “  ^®i  ~  °i^'^  ■^(Cj^)  (85) 

will  always  converge  locally  and  will  be  asymptotically  efficient. 
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3«^.3  Coavergenc«  with  Scale  Factor  Adjustment 

When  a  criterion  function  and  its  partial  derivatives  are  continuous 
in  an  c-neighborhood  of  a  point*  c^  ,  in  parameter  space*  then  if  70(0^^) 
differs  from  zero  it  is  always  possible  to  reduce  0(c)  by  proceeding  in 
parameter  apace  along  the  negative  gradient  for  a  sufficiently  small  distance. 
Equation  (37)  shows  that  Sauss-Newton  iteration  is  based  upon  a  linearly 
transformed  step  along  the  gradient.  This  linear  transformation  will,  in 
general,  induce  both  a  magnitude  and  an  angle  change  in  70  and  (as  shown 
by  the  preceding  theorem)  may  well  fail  to  produce  a  convergent  estimator 
of  the  true  parameter  vector,  c^  .  It  is  relevant,  therefore,  to  consider 
the  possibility  of  stabilizing  3auss-Newton  iteration  in  such  circumstances 
by  reducing  by  an  appropriate  scale  factor  whenever  the  basic  procedure 
fails  to  converge.  The  crucial  question  involved  in  the  implementation  of 
such  a  scheme  relates  to  the  angle  which  exists  between  the  gradient  vector 
and  the  Gauss-Newton  correction  vector,  .  Specifically,  since  the  gradient 

vector  is  normal  to  the  contour  lines  of  0  ,  motion  along  ^  for  a  suffi¬ 
ciently  small  distance  will  result  in  a  decrease  in  0  providing  only  that 
the  angle  between  ^  and  70  is  obtuse;  i.e.*  providing  that  the  scalar 
product  of  p  and  70  is  negative.  The  following  lemma  expresses  this 
statement  more  precisely  and  shows  that  the  angle  condition  is  always 
satisfied. 

Lemma  1; 

Consider  a  region,  R,  in  parameter  space  such  that  the  matrix  X 
is  of  full  rank  everywhere  in  R  .  Assume,  furthermore,  that  about  every 
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point  in  R  there  exists  an  e-neighborhood  in  which  a  uniformly  convergent 
Taylor  series  nay  be  constructed  for  the  criterion  function,  0(c)  .  Then  for 
any  point  in  R  such  that  ^(c)  /  0 

~  0(c  +  Ttp)  I  -  ^’|  <  0 

»  T1  -  O 

Proof; 

Locally,  0  is  represented  at  any  point  in  R  by 
0(c  +  Til)  -  0(c)  +  ^’tiI  +  0(T,^) 

80 

+  0(T1)  (88) 

and 

-  70  P  -  P  70 

T)  -  O 

Now  from  equation  (37) 

ifl  m  ~  2SP  .  -  2x'xp 

BO 

-  -  al'x'xp  -  -  2(x'?)'(x|)  =  -  2  Ixpl^  <  O  (91) 

However,  since  X  is  of  full  rank,  |X(3|  0  0  unless  3  >  0  .  Furthermore, 
since  70  /  O  ,  equation  (90)  shows  that  3/0  so  the  inequality  in  equation 


(89) 


(90) 


(86) 


(87) 
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(91)  mist  be  strict;  l.e 


-  -  2  !x|»^  <  0 

This  completes  the  proof. 


(92) 


Levaa  1  provides  the  basis  for  the  following  basic  convergence  theorem 
for  the  modified  Qauss-Newton  procedure 


Theorea  3; 

Assuae  that  the  conditions  of  Lemma  1  are  satisfied.  Suppose  that  the 
region  R  la  in  addition  convex  and  has  the  further  properties  that  the 
criterion  function,  0(c)  ,  is  concave  in  R  and  possesses  a  minimum  at  an 
interior  point,  c^  ,  of  R  .  Let  0  be  the  infimum  of  0  assumed  on  the 
boundary  of  H  and  lot  denote  the  parameter  vector  obtained  at  the 

i***  stage  of  iteration.  Then  for  any  initial  point,  c^^  ,  interior  to  R 
such  that 


0(c^)  <  0 


it  is  possible  to  choose  a  sequence  of  points 


®i+l  -  «  ♦  \  Pi  *  0  <  <  1 


such  that 


liB  -•  -• 

4  _  c,  ■  c 
i—  OD  i  e 


(93) 


(94) 


(95) 


A  related  theorem  involving  more  restrictions  on  R  has  been  proved  by 
H.  0.  Hartley  [14].  Hartley  refers  to  the  Gauss-Newton  iteration  procedure 
with  scale  factor  adjustment  as  the  "modified  Qauss-Newton  Method". 
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t 


Proof; 

Fro*  equation  (87)t  0  is  represented  locally  at  every  point  in  R 

by 

0(e  +  TIP)  -  0(c)  +  >  0{T)^)  (96) 


Now  due  to  the  concavity  of  0  •  if  ^  •  ^ben  *70/0  and 

p'^(cj^)  -  <  0  (97) 

80  there  exists  an  T]^  at  every  stage  of  iteration  such  that  when  c^^^ 
is  chosen  according  to  equation  (9^)«  then  is  interior  to  R  and 

0(Ci^j^)  -  0(c^  +  \Pi)  <  0(c^)  (98) 

Since  0  is  bounded  from  below  by  0(c^)  and  0(c^)  forms  a  monotone 
decreasing  sequence  for  the  c^^  so  chosen,  it  follows  that 

0Cc, )  .  0(S  )  (99) 

i-OD  ^  ® 

The  concavity  of  0  in  R  then  guarantees  that  the  sequence  in  parameter 
space  also  converges;  i.e. 


lim 

i-tflB 


(100) 


3>4.4  Discussion  and  Conclusions 

Theorem  1  can  be  utilized  in  circumstances  where  no  real  physical 
measurements  are  to  be  made.  It  is  applicable,  for  example,  in  situations 
where  a  differential  equation  is  to  be  determined  whose  solution  is  a  given 
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Analytic  or  conputed  fanction  [10 J.  Since  this  theorem  relates  to  local 
stability  only,  it  may  be  necessary  to  employ  a  scale  factor  reducing 
algorithm  in  the  Initial  stages  of  an  iteration  in  order  to  achieve  a 
convergent  sequence  of  pareueeter  vectors.  The  existence  of  such  an  algo¬ 
rithm  is  guaranteed  by  Theorem  3.  It  does  not  appear  that  Theorem  1  is 
of  any  significance  in  pareuneter  estimation  problems  involving  real  physical 
measurements  since  errors  are  inevitably  present  in  such  data. 

Theorem  2  provides  necessary  and  sufficient  conditions  for  the  Gauss- 
Newton  Iteration  to  be  locally  stable  without  scale  factor  adjustment  in  the 
case  where  the  minimum  of  0  is  greater  than  zero.  As  a  practical  computa¬ 
tional  matter,  it  is  unlikely  that  any  eigenvalues  of  the  matrix  Q  will  be 
close  to  unity.  More  likely,  either  all  values  will  be  considerably  less 
than  one  or  else  at  least  one  will  be  considerably  larger  than  one.  Thus 
one  would  expect  that  unmodified  Gauss-Newton  iteration  will  either  converge 
very  rapidly  in  the  later  stages  of  iteration  or  else  it  will  sharply  diverge. 
In  any  event,  computation  of  D  for  the  purpose  of  establishing  convergence 
does  not  seem  worthwhile.  The  number  of  differential  equations  to  be  solved 
to  obtain  the  elements  of  D  increases  alarmingly  with  an  increase  in  the 
dimensionality  of  c  .  It  is  much  easier  to  simply  compute  0Cc^  -t-  3^)  at 
each  step  to  see  if  an  improvement  results  than  to  investigate  stability  by 
determining  D  and  its  eigenvalues. 

The  above  remarks  also  apply  to  the  modified  Gauss-Newton  equation  of 
Corollary  II.  A  simple  search  along  the  vector  P  may  be  carried  out 

with  less  labor  than  is  involved  in  evaluating  D  at  each  st^Lge. 
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The  net  conclusions  to  be  drawn  from  the  stability  theorems 
developed  in  this  section  may  be  stated  in  tens  of  the  following 
computation  policy: 

1.  If  the  Gauss-Newton  parameter  change  vectori  «  provides  a 

reduction  in  the  sum-squared  error  functions,  0(c;y^)  ,  at  the  i-th 
stage  of  iteration,  use  “  ®i  ^i  * 

2.  If  this  is  not  so,  then  determine  searching  along 

for  a  locally  minimum  value  of  0  . 

The  implementation  of  this  policy  in  terms  of  a  complete  computational 
algorithm  for  unconstrained  nonlinear  regression  analysis  is  presented  later 
in  this  report. 


4.  STATISTICAL  PROPERTIES  OF  NONLINEAR 
LEAST  S^JJARHS  PARAMETER  ESTIMATES 

4.1  Linearized  Estimation  of  the  Variability  of  Parameter  Estimates 

The  preceding  investigation  of  the  stability  properties  of  iterated 
linear  least  squaures  regression  analysis  shows  that  it  is  possible  to  devise 
a  computationaLl  algorithm  which  will  ensure  convergence  of  a  sequence  of 
parameter  estimating  vectors  to  a  vector,  c  ,  which  locally  minimizes  the 
sum-squared  error  function,  0{c')  .  This  vector  may  fail  to  accurately 
estimate  the  true  parameter  vector,  c^  ,  for  two  reasons.  First  of  all, 
it  is  possible  that  the  iteration  process  will  converge  to  a  local  minimum 
of  0(c)  which  is  not  the  minimum  associated  with  c  «  c^  .  The  detection 
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and  correction  of  this  type  of  error  is  a  very  difficult  problem  whose 
treatment  has  been  reserved  for  a  later  report.  Secondly,  even  if  the 
coapitational  procedure  converges  to  the  proper  minimum  of  0(c),  measure¬ 
ment  and  other  errors  will  prevent  computer  generated  solutions  from  fitting 
the  data  exactly.  Consequently,  there  will  remain  a  residual  statistical 
error  in  the  estimation  of  the  unknown  system  parameters.  Estimation  of  the 
magnitude  of  this  type  of  error  by  analytic  procedures  is  extremely  difficult 
when  the  systems  involved  are  nonlinear  auid  the  errors  are  large.  Quite  often, 
error  estimates  can  be  obtained  only  by  recourse  to  statistical  methods 
involving  many  repetitions  of  the  same  physical  experiment.  However,  if  it 
can  be  assumed  that  c^  is  quite  close  to  ,  then  an  asymptotic  error 
estimate  may  be  obtained  by  linearizing  the  relationship  between  measurement 
errors  and  estimation  errors. 

In  order  to  obtain  the  desired  error  estimate,  assume  that  y(?  ) 

e 

can  be  represented  by  a  uniformly  convergent  Taylor  series  expansion  about 
c^  .  Denote  the  difference  between  the  estimated  and  true  parameter  vector 
by  ic  ;  l.e. 


Sc  m  c  -  c 
e  o 


(101) 


Then 


y(c.) 


-  y(c^)  >  X(c^)  Tc  I  ^ 

i.J  ^  J 


6c 6Cj  +  O(Sc^)  (102) 


c  *  c 

o 
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Now,  the  data  vector,  ,  is  given  by  equation  (3)  an  y^  « 
so  the  residual  error  vector,  e  ,  may  be  written 

•  -  -  y  -  ^  -  xrc  -  I  6c^  6cj  +  odch 

itj  ^ 

and 

-  26c  X  c  6c  X  X6c 

“  I  Gsrt  ‘*=1 

i.J  ^ 

■  £  e  -  2ic  X  e  *  ic  (S-D)  5c  +  0(5c^)  (104) 

where  S  and  D  have  the  same  significance  as  before.  Assuming  that  term- 
wise  differentiation  of  the  series  for  0  is  permissible, 

^  ^  .  _A  ,  .  2x’e  +  2(3-D)rc  +  0(5*c^)  (105) 

dc  d(fic) 

Since  c  is  a  minimizing  value  for  0  ,  this  expression  must  equal  zero 
and  therefore,  to  a  first  order  of  approximation 

x’c  «  (S  -  D)  6c  (106) 

or 

5c  *  (S  -  D)"^  x'e 

Typically,  will  be  a  random  variable  with  zero  mean.  In  such 

circumstances 


+  c 


(103) 
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E(4c)  -  (S-D)"^  x’E(e)  «  (3-D)"^  x'o  .  0  (107) 

and  the  covariance  matrix  for  6c  is  given  by 

Ccov  6c]  «  E{^  6c‘}  «  E{(S-D)"^  X^C(S-D)"^  X^c]'  } 

-  (S-D)"^  x'[cov  e]  X  (S-D)"^  (108) 

It  is  a  common  occurrence  in  experimental  situations  to  find  that 

Ccov  ?]  =  a^l  (109) 

When  this  is  the  case 

Ccov  dc]  .  O^(S-D)"^  x'x(S-D)"^ 

«  (S-D)"^  S(S-D)*^  (110) 

If,  furthermore,  D  «  [O]  ,  (as  in  linear  regression  analysis,  for  example), 
then  the  covariance  matrix  reduces  still  further  to 

[coy  6c]  »  =  o^S“^  (111) 

which  is  a  standard  result  in  linear  statistical  theory  [l6]. 

4,2  Kelationship  of  Length  of  Record  to  Parameter  Estimation  Accuracy 

It  is  a  general  characteristic  of  estimation  problems  that  accuracy 
nay  be  increased  by  considering  more  data  points.  When  the  estimation 
problem  involves  determination  of  dynamic  system  parameters,  this  neains 
that  it  should  be  possible  to  increase  the  reliability  of  parameter  estimates 
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by  observing  the  system  for  a  longer  period  of  time.  In  situations  where 
the  system  under  observation  is  known  to  be  linear  in  the  unknown  parameters, 
it  is  possible  to  obtain  quantitative  relationships  linking  estimation 
accuracy  to  the  amount  of  data  available  in  a  relatively  straightforward 
fashion  However*  when  the  system  is  nonlinear*  determination  of  such 

relationships  becomes  very  difficult. 

While  the  relationship  between  observation  time  and  estimation 
accuracy  may  very  well  be  unattainable  by  analytic  methods  in  the  general 
nonlinear  caise*  under  certain  simplifying  assumptions*  an  asymptotic  error 
theory  can  be  constructed.  In  order  to  arrive  at  such  a  theory,  it  is 
necessary  to  modify  the  parameter  estimation  equation  slightly  to  avoid 
the  possibility  of  the  S  matrix  becoming  singular  in  the  limit.  The 
estimation  formula  given  by  equation  (37)  ®ay  be  rewritten  as 

p  -  i  NS‘^  ’  I  °  I  ’  N 

auid  in  a  similar  fashion*  equation  (110)  cam  be  expressed  as 

Ccov  fic]  -  ^  -  f)-"  I  (|  -  f)'"  (113) 

With  the  covariance  matrix  for  Ac  written  in  this  form*  it  becomes 
possible  to  consider  its  behavior  as  the  number  of  saunples,  N  ,  approaches 
infinity. 

Assume  for  the  moment  that  the  matrices  S/N  and  D/N  have  limiting 
values  as  N  approaches  infinity.  Specifically  let 


11a 
I*-  ® 


S 

N 


V 


11a 

1^»aa 


D 

N 


D 
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Thatti  providing  that  the  limiting  operations  can  be  interchanged  with  the 
ailgebraic  operations •  it  follows  that 


NCcov  6c]  -  [V  -  U]‘^VCV  -  U]”^  (115) 

N  -*OD 


This  equation  reveals  that  the  magnitude  of  the  elements  of  the  covariance 
matrix  vary  inversely  with  the  number  of  samples;  this  is  a  familiar  fact 
In  linear  statistics  [18]. 

The  existence  of  the  assumed  limit  matrices  may  be  related  to  the 
behavior  of  the  solution  partial  derivatives.  Referring  to  the  definition 
of  the  matrix  of  partial  derivatives,  X  ,  (equation  (JO))  it  can  be  seen 
that  this  matrix  may  be  written  as  a  row  vector 


X  -  (Xj^,  X2, 


(116) 


whose  elements  are  the  column  vectors,  X^^  .  Each  X^^  is  in  turn  given  by 


3y(tj^) 

^1 

ayCt^) 

ac. 


5y(t„) 


(117) 
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Since  the  matrix  S  is  defined  as  S  ■  X  X,  this  notation  peinits  the 
elements t  f  of  S  to  be  written  as 


-  h 


(118) 


That  is,  the  elements  of  S  are  just  the  scalar  products  of  the  appropriate 
columns  of  the  matrix  X  .  The  elements  of  the  normalized  matrix,  V  ,  are 
in  turn  given  by 


11m  X^ 

’^ij  ■  N-a>  N 


(119) 


When  i  >  J  I  this  expression  reduces  to 


ii  H- 


N  2 


i.e.  is  just  the  mean  square  value  of  the  partial  derivative 

Sy/^Cj^  .  Consequently 

>  0  (121) 

when  it  exists.  Moreover,  unless  the  partial  derivative,  ,  tends 

toward  zero  as  t  becomes  arbitrarily  large,  the  inequality  is  strict;  i.e. 


Xii>o 


(122) 


When  1  /  j  ,  the  situation  is  somewhat  different.  Since  the  partial 


derivatives  involved  may  be  either  positive  or  negative,  the  terms  in  the  sum 


“  N-OB  L  N 


(123) 
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If  th*  rectors 
every  1  and  J 
equation  (113) 


y  -  y  and  ^  y/^c.  ^c.  are  likewise  uncorrelated  for 
o  a  i  j 

,  then  U  must  tend  to  a  zero  matrix.  When  this  occurs i 
reduces  to 


lia 

N->aD 


NCcov  Sc]  ■ 


OP 


-  ir% 


il 


and 


r(6Cj^  6Cj}  -  0  ,  1  /  j 


(126) 

(127) 

(128) 


5.  AN  AUTOMATIC  COMPUTATIONAL  AL30SITHM  FOP  PAPAMETER 
IDENTIFICATION  BY  NONLINEAR  REOREoSION  ANALYSIS 


3.1  Stopping  Pules 

To  provide  a  fully  automatic  basis  for  the  determination  of  minima  of 
a  criterion  function,  it  is  necessary  to  provide  a  stopping  rule  for  the 
iteration  sequence,  •  That  is,  the  iteration  cycle  must  be  subjected 

to  a  supervisory  procedure  which  determines  that  further  refinement  of  the 
estimate  of  the  true  pareuneter  vector,  c^  ,  is  not  justified.  Such  a 
stopping  rule  may  be  based,  for  example,  on  a  percentage  change  in  either 
the  parameter  space  or  the  criterion  function  itself.  Let 


d 

c 


(129) 


and 


l>(c^)  -  0(c,^,) 


(130) 
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be  the  aormallzed  differences  between  two  successive  computations  measured 
in  terms  of  the  parameter  estimates  and  criterion  function  respectively. 

Then  an  appropriate  stopping  rule  is: 

1.  Stop  if  d  <  e 

c  c 

2.  Stop  If  djj  < 

The  choice  of  the  values  of  c  and  must  be  made  in  advcince  by  the 

c  0  ^ 

human  experimenter.  Since  d  and  d-  may  behave  quite  differently,  the 

c  ^ 

Initial  values  for  these  two  quantities  may  require  adjustment  during  the 
course  of  an  experiment.  While  this  particular  stopping  rule  is  somewhat 
arbitrary,  it  is  essential  in  every  case  that  some  criterion  for  halting 
computation  be  provided  since  otherwise  the  computer  will  continue  to 
calculate  new  values  of  c  indefinitely  even  though  the  practical  value 
of  greater  precision  may  be  entirely  negligible. 

5.2  Stabilization  of  the  Iteration  by  Binary  Search 

In  the  discussion  of  the  significance  of  the  convergence  theorems,  a 
specific  computational  policy  was  proposed.  This  policy  involved  stabiliza¬ 
tion  of  the  iteration  procedure  by  superimposing  a  scale  factor  search  routine 
onto  the  basic  Gauss-Newton  iteration  scheme.  A  very  simple  type  of  search 
may  be  obtained  by  simply  reducing  the  parameter  change  vector  by  a  factor 
of  two  repeatedly  until  a  value  of  the  criterion  function  is  obtained  such 
that : 

1.  0  is  less  than  the  starting  value. 

2.  0  is  locally  minimized  (with  respect  to  the  binary  search  along  p). 


-76- 


Flgun  1  is  a  flow  diagram  for  a  computer  program  to  achieve  the 
desired  stabilization.  This  diagram  also  includes  a  "stability"  test  for 
the  function  y(t;?)  for  the  purpose  of  assuring  that  the  magnitude  of 
the  computed  variables  does  not  exceed  the  range  of  the  computer  arithmetic 
registers.  The  equations  written  in  the  boxes  of  Figure  1  represent 
difference  or  "up-dating"  relationships  as  is  conventional  in  such  diagrams. 
The  sequence  of  operations  is  always  to  be  t^0<en  from  top  to  bottom  in  each 
box. 

It  la  certainly  possible  to  devise  more  sophisticated  search  procedures 
than  the  one  shown  on  Figure  1.  More  efficient  techniques  are  currently 
baing  developed  and  will  provide  the  basis  for  future  reports.  However, 
binary  search  has  been  found  to  be  entirely  satisfactory  for  the  purpose 
of  experimentally  verifying  the  theoretical  results  obtained  in  this  study. 

5.3  Flow  Diagram  for  the  Complete  Algorithm 

All  of  the  preceding  discussion  is  summeLrlzed  in  the  flow  diagram 
displayed  as  Figure  2.  This  diagram  has  been  coded  in  the  Fortran  compiling 
language  and  utilized  to  produce  the  experimental  results  presented  later  in 
this  report.  The  numbers  appearing  at  various  pxjints  of  Figures  1  and  2 
refer  to  statement  numbers  from  the  Fortran  program.  While  the  program 
itself  has  not  been  Included  as  part  of  this  document,  copies  of  both  the 
Fortran  decks  aind  a  listing  of  these  decks  are  available  upon  request. 


FIGURE  2  Fl-OW  DIAGRAM  FOR  PARAMETER  ESTIMATION 
BY  NONLINEAR  REGRESSION  ANALYSIS. 


Equations  for  the  Inference  of  Pendulum  Parameters 


The  experiaenteil  part  of  this  study  is  concerned  with  the  estlnation 
of  parameters  for  the  pendulum  equation  (2) 

c^O  *  sin  0  a  0 

In  conformity  with  the  notation  established  for  regression  analysis i  this 
nay  be  rewritten  as 


c^*  Cj^y  ♦  sin  y  «  0 

The  parameter  vectort  ?  ,  is  therefore  given  by 


(131) 


(132) 


and  elements  of  the  parameter  influence  matrix,  X  ,  consist  of  the  psirtlal 
derivatives 

97(1,) 

These  partial  derivatives  are  obtainable  from  the  appropriate  parameter 
influence  equations.  For  example,  differentiating  equation  (131)  with 
respect  to  yields 


(c^*  ♦  Cj^y  +  sin  y) 


O 


(133) 
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or 


c-  "■  ^  3  +  c.  — 5-2—  4.  ^  4.  cos  jr  ■  o 

acj^  at*^  acj^  at  at  ac^^ 


(13'*) 


Assuming  that  the  order  of  differentiation  may  be  interchanged,  this  last 
equation  may  be  written 


*2*]^  ♦  Cj^Xj^  +  Xj^  cos  y  ■  -  y 


(135) 


Repeating  these  steps  with  respect  to  each  component  of  c  results  in  the 
set  of  equations 


€2X2  +  ®i’'2  *2  ^ 


•  -X,  +  c,x_  x_  cos  y  «  O 
<=3  •‘■3  3 


^  ®1*4  ■*•  *4  cos  y  ■  0 


(136) 

(137) 

(138) 


The  initial  conditions  necessary  for  the  solution  of  these  equations  are 
given  by 


,^(0)  .  ^  (0)  -  0  ;  ^^(0)  -  (0)  -  0 

*3^°^  “  3^  (0)  -  1  ;  ij(0)  -  (0)  -  0 

X.  (0)  -  (0)  -  O  ;  i,  (0)  -  (0)  -  1 


(139) 

(140) 

(141) 

(142) 


ay(o)  ’  ^y(o) 

This  completes  the  set  of  equations  required  for  regression  analysis.  The 
matrix  X  may  be  calculated  for  an  initial  parameter  vector  estimate, 


C  -  C ,  by  simultaneously  solving  equation  (I3I)  with  equations  (135) 
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thru  (13d}  subject  to  the  above  initial  conditions,  (^antitative  results 
obtained  by  carrying  out  this  procedure  are  presented  in  the  next  section 
of  this  report. 


6.  EXFESIMQtTAL  RESULTS 


6.1  Teralnal  Convergence  Without  Keasurenent  Noise 

la  order  to  illustrate  the  behavior  predicted  by  Theorem  1,  the 
program  Illustrated  in  Figure  2  was  executed  using  noise-free  data 
generated  by  the  computer.  The  data  was  produced  by  numerical  solution 
of  the  pendulum  equation  using  the  parameter  vector 


(143) 


The  resulting  response  function,  y(t;c^),  is  shown  on  Figure  3  an 
the  curve  labeled  "true  response".  The  data  vector,  y^,  required 
for  parameter  estimation  was  obtained  by  using  the  ten  equally  spaced 
samples  of  y(t;c^}  shown  on  the  figure  as  "accurate  data  points". 

nie  initial  guess  for  the  parameter  vector  used  to  start  the 
iteration  was  taken  as 


(144) 
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FIGURE  3  RESPONSE  CURVE  FITTED  TO  NOISY  DATA  BY 
THE  COMPUTER. 
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These  values  were  deliberately  chosen  to  be  very  close  to  the  true 
paraaeters  so  that  only  the  terminal  convergence  properties  of  unmodified 
Gauss— Newton  iteration  would  be  displayed.  The  stopping  criterion  used 
for  this  experiment  was 

«  e  .  10"^ 

0  c 

This  very  stringent  criterion  was  satisfied  after  only  two  unmodified 
Gauss-Newton  iteration  cycles.  The  sum-squared  error  was  reduced  from 
a  value  of  I.85  X  10  ^  to  a  value  of  I.09  x  10  by  the  two  steps  taken 
in  parameter  space.  Table  1  shows  the  sequence  of  values  generated  by 
the  computer  for  the  parameter  vector,  c  .  These  results  strongly 
support  the  conclusions  of  Theorem  1. 


TABLE  1 

RESULTS  OF  GAUSS-NEWTON  ITERATION  USING  EXACT  DATA 


6«2  Stabilization  of  Gauss-Newton  Iteration  by  Binary  Scale  Factor 

Adjustnent 

According  to  Lemma  1,  binary  scale  factor  adjustment  should  always 
produce  a  monotonically  decreasing  sequence  of  values  for  the  sum-squared 
error  function,  0  .  To  demonstrate  this  property,  a  computer  experiment 
was  conducted  utilizing  the  modified  Gauss-Newton  iteration  technique 
illustrated  in  Figure  2.  The  "accurate  data"  vector  computed  to  produce 
Table  1  was  also  used  in  this  experiment.  However,  the  parameter  vector 
selected  as  the  starting  point  for  the  iteration  was  intentionally  chosen 
to  be  quite  far  from  the  true  parameter  vector  so  that  searching  would  be 
required. 

Table  2  summarizes  the  results  obtained  from  60  iteration  cycles. 

The  variable  N  appearing  in  this  table  refers  to  the  number  of  times 
the  computer  reduced  the  parameter  change  vector,  ^  ,  by  a  factor  of  two 
while  searching  for  a  local  minimum  within  each  cycle.  Large  values  for 
N  Indicate  that  the  unmodified  Gauss-Newton  procedure  is  very  unstable 
at  that  point. 

This  example  agrees  with  theory  in  the  respect  that  a  lower  value 
for  was  obtained  at  every  stage.  It  also  shows  that  while  convergence 
does  occur  in  the  sequence  of  values  for  ^  ,  successive  values  for  c 
may  vary  quite  erratically.  It  must  be  emphasized  however,  that  the 
starting  value  for  ?  which  produced  this  data  was  grossly  in  error 
and  was  deliberately  chosen  to  illustrate  this  type  of  behavior.  Later 
examples  will  show  that  the  modified  Gauss-Newton  procedure  works  quite 


TABLE  2 


SEQDENCE  OF  PARAMETER  VALUES  OBTAINED  USING  G 

ITERATION  STABILIZED  BY  BINARY  SCALE  FACTOR 

AUSS -NEWTON 

ADJUSTMENT 

Iteration  c. 

Nuaber 

®2 

•=3 

®4 

0 

N 

0 

1.000 

2.000 

3.0000 

1.000 

360.3 

1 

0.093 

-  5.188 

4.582 

1.241 

212  A 

3 

2 

0.129 

-  9.877 

3.643 

-  1.141 

163.8 

3 

3 

0.329 

-  73.41 

0.4556 

-  0.1078 

2.422 

0 

4 

-  0.948 

-  19.96 

0.4733 

-  0.1223 

1.819 

8 

5 

-  0.449 

-  16.53 

0.4784 

-  0.1269 

1.816 

10 

6 

-  0.016 

-  13.79 

0.4834 

-  0.1313 

1.813 

10 

7 

0.192 

-  12.61 

0.4858 

-  0.1335 

1.811 

11 

8 

0.386 

-  11.55 

0.4882 

-  0.1356 

1.809 

11 

9 

0.568 

-  10.57 

0.4905 

-  0.1377 

1.807 

11 

10 

6.249 

18.55 

0.5641 

-  0.2018 

1.619 

6 

11 

11.91 

,  31.69 

0.6667 

-  0.3057 

1.565 

6 

12 

18.09 

44.74 

0.7209 

-  0.5642 

1.541 

7 

13 

23.51 

55.87 

0.7490 

-  0.3955 

1.529 

8 

14 

31.49 

72.08 

0.7776 

-  0.4280 

1.519 

8 

15 

37.71 

84.61 

0.7922 

-  0.4448 

1.513 

9 

20 

105.1 

219.0 

0.8517 

-  0.5152 

1.494 

10 

25 

243.0 

493.1 

0.8747 

-  0.5430 

1.488 

11 

30 

534.6 

1072. 

0.8853 

-  0.5559 

1.485 

13 

40 

1772. 

3531. 

0.8909 

-  0.5628 

1.483 

14 

60 

1213. 

2237. 

0.9330 

-  0.6143 

1.421 

14 

True 

0.500 

1.000 

1.5690 

0.0000 

0.0000 

Values 


well,  even  with  noisy  data,  when  more  reasonable  starting  points  are 
provided. 

It  was  not  determined  whether  or  not  the  sequence  of  values  listed 
in  Table  2  would  eventually  converge  to  the  true  parameter  values  or  to 
some  secondary  minimum  in  parameter  space.  Since  the  parameter  estimation 
procedure  was  unconstrained  in  this  experiment,  the  possibility  even 
exists  that  the  sequence  in  c  might  grow  without  bound  and  fail  to 
converge  altogether  even  though  0  were  to  approach  a  limiting  value. 

The  experiment  was  discontinued  after  60  iterations  because  the  cost  of 
computer  time  seemed  to  exceed  the  value  of  the  information  which  would 
probably  be  obtained  from  further  computer  runs. 

The  poor  terminal  behavior  of  the  iteration  procedure  in  this  case 
is  explained  in  part  by  a  consideration  of  the  angle  existing  between  the 
Gauss-Newton  parameter  change  vector,  0  ,  and  the  gradient  of  0  .  At 
iteration  number  60,  the  values  obtained  for  these  two  vectors  were 


1.108  X  10 


2.033  X  10 

4.027 
.074 


4.027  J 

^-5.074  J 


f 


-  1.159  X  10"^ 
6.286  X  10“^ 

-  1.849 

-  2.389 


\ 


The  cosine  of  the  angle  existing  between  these  vectors  may  be  obtained 
by  normalizing  the  scalar  produce;  i.e. 


cos  0  .  t  *'?-  -  2.2  X  lO"”^ 

ipn-j0i 


(145) 
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This  result  shows  that  the  angle  between  the  contour  lines  of  0  and 
the  B  vector  is  only  about  2  x  lO”  radians;  i.e.,  ^  points  "down¬ 
hill"  only  very  slightly.  As  a  consequence.  0  is  reduced  only  when 
?  ia  diminished  by  a  very  large  scale  factor  (2^^  for  the  optimum  step 
is  this  case). 

6.3  Convergence  with  Measurement  Noise. 

To  simulate  measurement  errors,  ten  uncorrelated  samples  of  a 
unit  variance,  zero  mean  Gaussian  random  process  were  generated  by  the 
digital  computer.  These  errors  were  then  scaled  to  several  different 
levels  and  added  to  the  data  points  of  Figure  5.  The  resulting  noisy 
response  data  is  plotted  on  this  figure  for  a  scale  factor,  a  ,  equal  to 
one-tenth  Figure  5  also  shows  the  response  fitted  to  these  data  points 
by  the  computer. 

Table  3  summarizes  the  results  obtained  by  using  different  values 
of  a  in  this  experiment.  The  variable  M  in  the  table  stands  for  the 
number  cycles  of  iteration  required  to  satisfy  the  error  criterion 

V  ' 

Tlie  starting  vector  used  in  each  case  was  as  given  by  equation  (144). 

The  iteration  converged  without  scale  factor  adjustment  for  values  of 
a  less  than  n  »  .4.  For  7  =  .4  and  .5  ,  the  modified  iteration 
procedure  converged  but  the  computer  sometimes  found  it  necessary  to 

Since  the  basic  process  sampled  by  the  computer  possesses  a  unit 
variance,  the  scale  factor,  7  ,  is  just  the  standard  deviation  of 
the  scaled  process. 
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reduce  P  by  a  factor  of  two.  The  drastic  scale  factor  adjustment 
encountered  in  the  computation  leading  to  Table  2  did  not  occur  in  this 
experiment  since  the  starting  parameter  vector  was  reasonably  close  to 
the  final  minimizing  vector.  The  apparently  anomalous  behavior  of  the 
variable  M  for  o  =  .4  and  .5  is  caused  by  the  fact  that  the  parameter 
adjusting  steps  are  more  effective  when  the  search  algorithm  is  evoked. 


TABLE  3 


PARAMETER  ESTIMATES  OBTAINED  FROM  NOISY  DATA 


a 

*=1 

®2 

*^4 

0 

M 

.0 

.50000021 

1.0000002 

1.5690004 

.00000068 

1.085  X  10"^5 

2 

.1 

.4682 

.9542 

1.653 

.06502 

.0568 

6 

.2 

.4502 

.9265 

1.782 

.05652 

.2290 

11 

.3 

.4387 

.9064 

1.945 

.0092 

.5180 

23 

.4 

.4319 

.8905 

2.143 

.1371 

.9240 

11 

.5 

.4289 

.8766 

2.580 

.3369 

1.4470 

10 

True 

Values 

.50000000 

1.00000000 

1.5690000 

.00000000 

0 

- 

This  experiment  shows  that  very  large  errors  may  exist  in  data 
points  without  upsetting  the  convergence  of  the  modified  Sauss-Newton 
procedure  when  the  iteration  is  started  sufficiently  close  to  the  final 

value. 
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6,4  Effect  of  Length  of  Record  on  Estimation  Accuracy 

The  ixnearized  error  tneory  developed  in  section  h.l  requires  that 
the  matrix  of  second  partial  derivatives,  D  ,  be  computed.  As  has  been 
pointed  out,  this  is  a  formidable  task  in  all  but  the  very  simplest  of 
situations.  However,  as  the  previous  experiment  shows,  the  convergence 
of  the  unmodified  Gauss-Newton  procedure  turns  out  to  be  quite  rapid 
even  in  the  presence  of  modest  amounts  of  noise  (o  <  .1).  This  being  the 
case,  the  results  of  Theorem  1  and  2  seem  to  indicate  that  the  matrix  D 
is  probably  quite  insignificant  in  comparison  to  the  regression  matrix,  S. 
When  this  is  in  fact  true,  the  linearized  estimate  of  the  variability  of 
parameter  estimates  is  given  by  equation  (111) 

[cow  6c]  = 

To  check  the  applicability  of  equation  (111),  a  total  of  thirty  statistical 
experiments  were  conducted.  A  data  vector  was  *enerated  for  each  experiment 
by  producing  a  Gaussian  error  vector  from  a  random  number  generation  routine 
and  adding  this  to  the  standard  correct  solution  to  the  system  equation. 

The  parameter  vector  used  to  generate  the  basic  solution  was 


(146) 


,0 


The  thirty  cooputer  runs  were  broiten  into  three  groups  of  ten  each  with 
the  data  points  collected  over  10  seconds  of  simulated  operation  in  the 
first  group,  30  seconds  in  the  second  group  and  100  seconds  in  the  third 
group.  Samples  were  taken  at  one  second  intervals  in  every  case  and 
uncorrelated  Gaussian  noise  with  'S  s  .1  was  added  to  each  set  of  data 
points. 

Table  4  summarizes  the  results  of  this  experiment.  The  theoretical 
standard  deviations  appearing  in  this  table  were  computed  using  the 
approximate  formula  given  fay  equation  (111).  The  experimental  standard 
deviations  represent  the  observed  r.m.s.  deviation  of  the  sample  values 
from  the  true  value  for  each  parameter. 

TABLE  V 

EFFECT  OF  LENGTH  OF  RECORD  ON  ESTIMATION  ACCURACY 


Parameter 

Number  of 
Response 
Samples 

Average 

True  Value  Experimental 
Value 

Theoretical 

Standard 

Deviation 

Experimental 

Standard 

Deviation 

‘^l 

10 

0.0500 

0.0534 

.0355 

.0407 

10 

1.0000 

1.0095 

.0391 

.0268 

10 

1.0000 

1.0422 

.1181 

.1424 

10 

0.0000 

0.0529 

.1028 

.1138 

30 

0.0500 

0.0516 

.00868 

.00350 

«2 

30 

1.0000 

0.9995 

.01028 

.00741 

®3 

30 

1.0000 

1.0071 

.0674 

.0469 

®4 

30 

0.0000 

0.0174 

.0633 

.0340 

100 

0.0500 

0.0493 

.00375 

.00279 

®2 

100 

1.0000 

0.9995 

.00440 

.00378 

®3 

100 

1.0000 

0.9777 

.0487 

.0436 

100 

0.0000 

0.0183 

.0520 

.0453 
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The  agreement  between  the  theoretical  and  observed  quantities  In 
this  table  appears  to  support  the  analysis  carried  out  in  this  report. 
The  Sauss-Newton  iteration  procedure  converged  without  searching  for 
every  one  of  the  thirty  computer  runs.  The  rms  error  in  parameter 
eatimation  seems  to  decrease  roughly  in  proportion  to  the  square  root 
of  the  number  of  samples,  as  predicted  by  theory.  The  experimental 
standard  deviations  are  well  predicted  by  equation  (111)  in  every  case. 
This  result  lends  credence  to  the  hypothesis  that  the  D  matrix  may 
commonly  be  ignored  in  computing  rough  error  estimates  in  situations 
where  the  unmodified  Gauss-Newton  procedure  converges. 

7.  SUMMARY  AND  CONCLUSIONS 

This  study  shows  that  the  methods  of  linear  regression  analysis 
may  be  extended  to  cover  the  estimation  of  parameters  for  nonlinear 
dynamic  systems.  The  minimization  of  the  sum  squared  error  criterion 
function  may  be  accomplished  by  treating  the  nonlinear  problem  as  a 
succession  of  linear  problems.  The  local  linearization  required  can 
be  achieved  by  solving  the  auxiliary  parameter  influence  differential 
equations.  The  necessary  and  sufficient  conditions  for  the  convergence 
of  this  process  involve  the  eigenvalues  of  a  particular  matrix  of  first 
and  second  order  partial  derivatives  of  the  nonlinear  system  response 
with  respect  to  the  unknown  parameters. 

A  complete  parameter  estimation  algorithm  possessing  universal 
stability  may  be  constructed  by  adding  an  automatic  scale  factor 
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•djusting  feature  to  the  iterated  linear  estinatlon  procedure.  This 
algorithm  produces  not  only  parameter  estimates,  but  also  an  approximate 
evaluation  of  the  accuracy  of  the  estimates.  The  results  obtained  by 
utilizing  this  method  in  experiments  involving  the  response  of  a  pendulum 
are  entirely  in  agreement  with  theory  and  establish  iterated  linear 
regression  analysis  as  a  practical  tool  for  the  Identification  of  non¬ 
linear  dynamic  systems. 

Certain  problems  remain  unresolved.  While  it  always  converges,  the 
modified  Gauss-Newton  or  iterated  linear  regression  method  is  really 
effective  only  when  initial  parameter  estimates  are  not  grossly  in  error. 
Alternative  procedures  are  needed  for  situations  in  which  good  initial 
estimates  are  not  available.  The  problem  of  constrained  parameter 
estimation  has  not  been  treated.  The  possible  existence  of  multiple 
minima  of  the  criterion  function  has  likewise  been  ignored.  All  of  these 
problems  will  be  made  the  subject  of  further  research  and  reports  will 
be  submitted  whenever  significant  progress  has  been  made. 
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